Skip to content

Commit cab56c5

Browse files
alibertsCadene
authored andcommitted
Dataset v2.0 (huggingface#461)
Co-authored-by: Remi <remi.cadene@huggingface.co>
1 parent 1568b6d commit cab56c5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+6109
-2229
lines changed

.github/PULL_REQUEST_TEMPLATE.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Provide a simple way for the reviewer to try out your changes.
2121

2222
Examples:
2323
```bash
24-
DATA_DIR=tests/data pytest -sx tests/test_stuff.py::test_something
24+
pytest -sx tests/test_stuff.py::test_something
2525
```
2626
```bash
2727
python lerobot/scripts/train.py --some.option=true

.github/workflows/nightly-tests.yml

+1-7
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,8 @@ on:
77
schedule:
88
- cron: "0 2 * * *"
99

10-
env:
11-
DATA_DIR: tests/data
10+
# env:
1211
# SLACK_API_TOKEN: ${{ secrets.SLACK_API_TOKEN }}
13-
1412
jobs:
1513
run_all_tests_cpu:
1614
name: CPU
@@ -30,13 +28,9 @@ jobs:
3028
working-directory: /lerobot
3129
steps:
3230
- name: Tests
33-
env:
34-
DATA_DIR: tests/data
3531
run: pytest -v --cov=./lerobot --disable-warnings tests
3632

3733
- name: Tests end-to-end
38-
env:
39-
DATA_DIR: tests/data
4034
run: make test-end-to-end
4135

4236

.github/workflows/test.yml

+1-4
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ jobs:
2929
name: Pytest
3030
runs-on: ubuntu-latest
3131
env:
32-
DATA_DIR: tests/data
3332
MUJOCO_GL: egl
3433
steps:
3534
- uses: actions/checkout@v4
@@ -70,7 +69,6 @@ jobs:
7069
name: Pytest (minimal install)
7170
runs-on: ubuntu-latest
7271
env:
73-
DATA_DIR: tests/data
7472
MUJOCO_GL: egl
7573
steps:
7674
- uses: actions/checkout@v4
@@ -103,12 +101,11 @@ jobs:
103101
-W ignore::UserWarning:gymnasium.utils.env_checker:247 \
104102
&& rm -rf tests/outputs outputs
105103
106-
104+
# TODO(aliberts, rcadene): redesign after v2 migration / removing hydra
107105
end-to-end:
108106
name: End-to-end
109107
runs-on: ubuntu-latest
110108
env:
111-
DATA_DIR: tests/data
112109
MUJOCO_GL: egl
113110
steps:
114111
- uses: actions/checkout@v4

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,7 @@ We use `pytest` in order to run the tests. From the root of the
267267
repository, here's how to run tests with `pytest` for the library:
268268

269269
```bash
270-
DATA_DIR="tests/data" python -m pytest -sv ./tests
270+
python -m pytest -sv ./tests
271271
```
272272

273273

README.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -153,10 +153,12 @@ python lerobot/scripts/visualize_dataset.py \
153153
--episode-index 0
154154
```
155155

156-
or from a dataset in a local folder with the root `DATA_DIR` environment variable (in the following case the dataset will be searched for in `./my_local_data_dir/lerobot/pusht`)
156+
or from a dataset in a local folder with the `root` option and the `--local-files-only` (in the following case the dataset will be searched for in `./my_local_data_dir/lerobot/pusht`)
157157
```bash
158-
DATA_DIR='./my_local_data_dir' python lerobot/scripts/visualize_dataset.py \
158+
python lerobot/scripts/visualize_dataset.py \
159159
--repo-id lerobot/pusht \
160+
--root ./my_local_data_dir \
161+
--local-files-only 1 \
160162
--episode-index 0
161163
```
162164

@@ -208,12 +210,10 @@ dataset attributes:
208210

209211
A `LeRobotDataset` is serialised using several widespread file formats for each of its parts, namely:
210212
- hf_dataset stored using Hugging Face datasets library serialization to parquet
211-
- videos are stored in mp4 format to save space or png files
212-
- episode_data_index saved using `safetensor` tensor serialization format
213-
- stats saved using `safetensor` tensor serialization format
214-
- info are saved using JSON
213+
- videos are stored in mp4 format to save space
214+
- metadata are stored in plain json/jsonl files
215215

216-
Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can set the `DATA_DIR` environment variable to your root dataset folder as illustrated in the above section on dataset visualization.
216+
Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can use the `local_files_only` argument and specify its location with the `root` argument if it's not in the default `~/.cache/huggingface/lerobot` location.
217217

218218
### Evaluate a pretrained policy
219219

benchmarks/video/run_video_benchmark.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ def benchmark_encoding_decoding(
266266
)
267267

268268
ep_num_images = dataset.episode_data_index["to"][0].item()
269-
width, height = tuple(dataset[0][dataset.camera_keys[0]].shape[-2:])
269+
width, height = tuple(dataset[0][dataset.meta.camera_keys[0]].shape[-2:])
270270
num_pixels = width * height
271271
video_size_bytes = video_path.stat().st_size
272272
images_size_bytes = get_directory_size(imgs_dir)

examples/10_use_so100.md

+2-7
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,6 @@ Record 2 episodes and upload your dataset to the hub:
192192
python lerobot/scripts/control_robot.py record \
193193
--robot-path lerobot/configs/robot/so100.yaml \
194194
--fps 30 \
195-
--root data \
196195
--repo-id ${HF_USER}/so100_test \
197196
--tags so100 tutorial \
198197
--warmup-time-s 5 \
@@ -212,18 +211,16 @@ echo ${HF_USER}/so100_test
212211
If you didn't upload with `--push-to-hub 0`, you can also visualize it locally with:
213212
```bash
214213
python lerobot/scripts/visualize_dataset_html.py \
215-
--root data \
216214
--repo-id ${HF_USER}/so100_test
217215
```
218216

219217
## Replay an episode
220218

221219
Now try to replay the first episode on your robot:
222220
```bash
223-
DATA_DIR=data python lerobot/scripts/control_robot.py replay \
221+
python lerobot/scripts/control_robot.py replay \
224222
--robot-path lerobot/configs/robot/so100.yaml \
225223
--fps 30 \
226-
--root data \
227224
--repo-id ${HF_USER}/so100_test \
228225
--episode 0
229226
```
@@ -232,7 +229,7 @@ DATA_DIR=data python lerobot/scripts/control_robot.py replay \
232229

233230
To train a policy to control your robot, use the [`python lerobot/scripts/train.py`](../lerobot/scripts/train.py) script. A few arguments are required. Here is an example command:
234231
```bash
235-
DATA_DIR=data python lerobot/scripts/train.py \
232+
python lerobot/scripts/train.py \
236233
dataset_repo_id=${HF_USER}/so100_test \
237234
policy=act_so100_real \
238235
env=so100_real \
@@ -248,7 +245,6 @@ Let's explain it:
248245
3. We provided an environment as argument with `env=so100_real`. This loads configurations from [`lerobot/configs/env/so100_real.yaml`](../lerobot/configs/env/so100_real.yaml).
249246
4. We provided `device=cuda` since we are training on a Nvidia GPU, but you can also use `device=mps` if you are using a Mac with Apple silicon, or `device=cpu` otherwise.
250247
5. We provided `wandb.enable=true` to use [Weights and Biases](https://docs.wandb.ai/quickstart) for visualizing training plots. This is optional but if you use it, make sure you are logged in by running `wandb login`.
251-
6. We added `DATA_DIR=data` to access your dataset stored in your local `data` directory. If you dont provide `DATA_DIR`, your dataset will be downloaded from Hugging Face hub to your cache folder `$HOME/.cache/hugginface`. In future versions of `lerobot`, both directories will be in sync.
252248

253249
Training should take several hours. You will find checkpoints in `outputs/train/act_so100_test/checkpoints`.
254250

@@ -259,7 +255,6 @@ You can use the `record` function from [`lerobot/scripts/control_robot.py`](../l
259255
python lerobot/scripts/control_robot.py record \
260256
--robot-path lerobot/configs/robot/so100.yaml \
261257
--fps 30 \
262-
--root data \
263258
--repo-id ${HF_USER}/eval_act_so100_test \
264259
--tags so100 tutorial eval \
265260
--warmup-time-s 5 \

examples/11_use_moss.md

+2-7
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,6 @@ Record 2 episodes and upload your dataset to the hub:
192192
python lerobot/scripts/control_robot.py record \
193193
--robot-path lerobot/configs/robot/moss.yaml \
194194
--fps 30 \
195-
--root data \
196195
--repo-id ${HF_USER}/moss_test \
197196
--tags moss tutorial \
198197
--warmup-time-s 5 \
@@ -212,18 +211,16 @@ echo ${HF_USER}/moss_test
212211
If you didn't upload with `--push-to-hub 0`, you can also visualize it locally with:
213212
```bash
214213
python lerobot/scripts/visualize_dataset_html.py \
215-
--root data \
216214
--repo-id ${HF_USER}/moss_test
217215
```
218216

219217
## Replay an episode
220218

221219
Now try to replay the first episode on your robot:
222220
```bash
223-
DATA_DIR=data python lerobot/scripts/control_robot.py replay \
221+
python lerobot/scripts/control_robot.py replay \
224222
--robot-path lerobot/configs/robot/moss.yaml \
225223
--fps 30 \
226-
--root data \
227224
--repo-id ${HF_USER}/moss_test \
228225
--episode 0
229226
```
@@ -232,7 +229,7 @@ DATA_DIR=data python lerobot/scripts/control_robot.py replay \
232229

233230
To train a policy to control your robot, use the [`python lerobot/scripts/train.py`](../lerobot/scripts/train.py) script. A few arguments are required. Here is an example command:
234231
```bash
235-
DATA_DIR=data python lerobot/scripts/train.py \
232+
python lerobot/scripts/train.py \
236233
dataset_repo_id=${HF_USER}/moss_test \
237234
policy=act_moss_real \
238235
env=moss_real \
@@ -248,7 +245,6 @@ Let's explain it:
248245
3. We provided an environment as argument with `env=moss_real`. This loads configurations from [`lerobot/configs/env/moss_real.yaml`](../lerobot/configs/env/moss_real.yaml).
249246
4. We provided `device=cuda` since we are training on a Nvidia GPU, but you can also use `device=mps` if you are using a Mac with Apple silicon, or `device=cpu` otherwise.
250247
5. We provided `wandb.enable=true` to use [Weights and Biases](https://docs.wandb.ai/quickstart) for visualizing training plots. This is optional but if you use it, make sure you are logged in by running `wandb login`.
251-
6. We added `DATA_DIR=data` to access your dataset stored in your local `data` directory. If you dont provide `DATA_DIR`, your dataset will be downloaded from Hugging Face hub to your cache folder `$HOME/.cache/hugginface`. In future versions of `lerobot`, both directories will be in sync.
252248

253249
Training should take several hours. You will find checkpoints in `outputs/train/act_moss_test/checkpoints`.
254250

@@ -259,7 +255,6 @@ You can use the `record` function from [`lerobot/scripts/control_robot.py`](../l
259255
python lerobot/scripts/control_robot.py record \
260256
--robot-path lerobot/configs/robot/moss.yaml \
261257
--fps 30 \
262-
--root data \
263258
--repo-id ${HF_USER}/eval_act_moss_test \
264259
--tags moss tutorial eval \
265260
--warmup-time-s 5 \

examples/1_load_lerobot_dataset.py

+83-40
Original file line numberDiff line numberDiff line change
@@ -3,78 +3,120 @@
33
It illustrates how to load datasets, manipulate them, and apply transformations suitable for machine learning tasks in PyTorch.
44
55
Features included in this script:
6-
- Loading a dataset and accessing its properties.
7-
- Filtering data by episode number.
8-
- Converting tensor data for visualization.
9-
- Saving video files from dataset frames.
6+
- Viewing a dataset's metadata and exploring its properties.
7+
- Loading an existing dataset from the hub or a subset of it.
8+
- Accessing frames by episode number.
109
- Using advanced dataset features like timestamp-based frame selection.
1110
- Demonstrating compatibility with PyTorch DataLoader for batch processing.
1211
1312
The script ends with examples of how to batch process data using PyTorch's DataLoader.
1413
"""
1514

16-
from pathlib import Path
1715
from pprint import pprint
1816

19-
import imageio
2017
import torch
18+
from huggingface_hub import HfApi
2119

2220
import lerobot
23-
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
21+
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
2422

23+
# We ported a number of existing datasets ourselves, use this to see the list:
2524
print("List of available datasets:")
2625
pprint(lerobot.available_datasets)
2726

28-
# Let's take one for this example
29-
repo_id = "lerobot/pusht"
30-
31-
# You can easily load a dataset from a Hugging Face repository
27+
# You can also browse through the datasets created/ported by the community on the hub using the hub api:
28+
hub_api = HfApi()
29+
repo_ids = [info.id for info in hub_api.list_datasets(task_categories="robotics", tags=["LeRobot"])]
30+
pprint(repo_ids)
31+
32+
# Or simply explore them in your web browser directly at:
33+
# https://huggingface.co/datasets?other=LeRobot
34+
35+
# Let's take this one for this example
36+
repo_id = "lerobot/aloha_mobile_cabinet"
37+
# We can have a look and fetch its metadata to know more about it:
38+
ds_meta = LeRobotDatasetMetadata(repo_id)
39+
40+
# By instantiating just this class, you can quickly access useful information about the content and the
41+
# structure of the dataset without downloading the actual data yet (only metadata files — which are
42+
# lightweight).
43+
print(f"Total number of episodes: {ds_meta.total_episodes}")
44+
print(f"Average number of frames per episode: {ds_meta.total_frames / ds_meta.total_episodes:.3f}")
45+
print(f"Frames per second used during data collection: {ds_meta.fps}")
46+
print(f"Robot type: {ds_meta.robot_type}")
47+
print(f"keys to access images from cameras: {ds_meta.camera_keys=}\n")
48+
49+
print("Tasks:")
50+
print(ds_meta.tasks)
51+
print("Features:")
52+
pprint(ds_meta.features)
53+
54+
# You can also get a short summary by simply printing the object:
55+
print(ds_meta)
56+
57+
# You can then load the actual dataset from the hub.
58+
# Either load any subset of episodes:
59+
dataset = LeRobotDataset(repo_id, episodes=[0, 10, 11, 23])
60+
61+
# And see how many frames you have:
62+
print(f"Selected episodes: {dataset.episodes}")
63+
print(f"Number of episodes selected: {dataset.num_episodes}")
64+
print(f"Number of frames selected: {dataset.num_frames}")
65+
66+
# Or simply load the entire dataset:
3267
dataset = LeRobotDataset(repo_id)
68+
print(f"Number of episodes selected: {dataset.num_episodes}")
69+
print(f"Number of frames selected: {dataset.num_frames}")
3370

34-
# LeRobotDataset is actually a thin wrapper around an underlying Hugging Face dataset
35-
# (see https://huggingface.co/docs/datasets/index for more information).
36-
print(dataset)
37-
print(dataset.hf_dataset)
71+
# The previous metadata class is contained in the 'meta' attribute of the dataset:
72+
print(dataset.meta)
3873

39-
# And provides additional utilities for robotics and compatibility with Pytorch
40-
print(f"\naverage number of frames per episode: {dataset.num_samples / dataset.num_episodes:.3f}")
41-
print(f"frames per second used during data collection: {dataset.fps=}")
42-
print(f"keys to access images from cameras: {dataset.camera_keys=}\n")
74+
# LeRobotDataset actually wraps an underlying Hugging Face dataset
75+
# (see https://huggingface.co/docs/datasets for more information).
76+
print(dataset.hf_dataset)
4377

44-
# Access frame indexes associated to first episode
78+
# LeRobot datasets also subclasses PyTorch datasets so you can do everything you know and love from working
79+
# with the latter, like iterating through the dataset.
80+
# The __getitem__ iterates over the frames of the dataset. Since our datasets are also structured by
81+
# episodes, you can access the frame indices of any episode using the episode_data_index. Here, we access
82+
# frame indices associated to the first episode:
4583
episode_index = 0
4684
from_idx = dataset.episode_data_index["from"][episode_index].item()
4785
to_idx = dataset.episode_data_index["to"][episode_index].item()
4886

49-
# LeRobot datasets actually subclass PyTorch datasets so you can do everything you know and love from working
50-
# with the latter, like iterating through the dataset. Here we grab all the image frames.
51-
frames = [dataset[idx]["observation.image"] for idx in range(from_idx, to_idx)]
87+
# Then we grab all the image frames from the first camera:
88+
camera_key = dataset.meta.camera_keys[0]
89+
frames = [dataset[idx][camera_key] for idx in range(from_idx, to_idx)]
5290

53-
# Video frames are now float32 in range [0,1] channel first (c,h,w) to follow pytorch convention. To visualize
54-
# them, we convert to uint8 in range [0,255]
55-
frames = [(frame * 255).type(torch.uint8) for frame in frames]
56-
# and to channel last (h,w,c).
57-
frames = [frame.permute((1, 2, 0)).numpy() for frame in frames]
91+
# The objects returned by the dataset are all torch.Tensors
92+
print(type(frames[0]))
93+
print(frames[0].shape)
5894

59-
# Finally, we save the frames to a mp4 video for visualization.
60-
Path("outputs/examples/1_load_lerobot_dataset").mkdir(parents=True, exist_ok=True)
61-
imageio.mimsave("outputs/examples/1_load_lerobot_dataset/episode_0.mp4", frames, fps=dataset.fps)
95+
# Since we're using pytorch, the shape is in pytorch, channel-first convention (c, h, w).
96+
# We can compare this shape with the information available for that feature
97+
pprint(dataset.features[camera_key])
98+
# In particular:
99+
print(dataset.features[camera_key]["shape"])
100+
# The shape is in (h, w, c) which is a more universal format.
62101

63102
# For many machine learning applications we need to load the history of past observations or trajectories of
64103
# future actions. Our datasets can load previous and future frames for each key/modality, using timestamps
65104
# differences with the current loaded frame. For instance:
66105
delta_timestamps = {
67106
# loads 4 images: 1 second before current frame, 500 ms before, 200 ms before, and current frame
68-
"observation.image": [-1, -0.5, -0.20, 0],
69-
# loads 8 state vectors: 1.5 seconds before, 1 second before, ... 20 ms, 10 ms, and current frame
70-
"observation.state": [-1.5, -1, -0.5, -0.20, -0.10, -0.02, -0.01, 0],
107+
camera_key: [-1, -0.5, -0.20, 0],
108+
# loads 8 state vectors: 1.5 seconds before, 1 second before, ... 200 ms, 100 ms, and current frame
109+
"observation.state": [-1.5, -1, -0.5, -0.20, -0.10, 0],
71110
# loads 64 action vectors: current frame, 1 frame in the future, 2 frames, ... 63 frames in the future
72111
"action": [t / dataset.fps for t in range(64)],
73112
}
113+
# Note that in any case, these delta_timestamps values need to be multiples of (1/fps) so that added to any
114+
# timestamp, you still get a valid timestamp.
115+
74116
dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
75-
print(f"\n{dataset[0]['observation.image'].shape=}") # (4,c,h,w)
76-
print(f"{dataset[0]['observation.state'].shape=}") # (8,c)
77-
print(f"{dataset[0]['action'].shape=}\n") # (64,c)
117+
print(f"\n{dataset[0][camera_key].shape=}") # (4, c, h, w)
118+
print(f"{dataset[0]['observation.state'].shape=}") # (6, c)
119+
print(f"{dataset[0]['action'].shape=}\n") # (64, c)
78120

79121
# Finally, our datasets are fully compatible with PyTorch dataloaders and samplers because they are just
80122
# PyTorch datasets.
@@ -84,8 +126,9 @@
84126
batch_size=32,
85127
shuffle=True,
86128
)
129+
87130
for batch in dataloader:
88-
print(f"{batch['observation.image'].shape=}") # (32,4,c,h,w)
89-
print(f"{batch['observation.state'].shape=}") # (32,8,c)
90-
print(f"{batch['action'].shape=}") # (32,64,c)
131+
print(f"{batch[camera_key].shape=}") # (32, 4, c, h, w)
132+
print(f"{batch['observation.state'].shape=}") # (32, 5, c)
133+
print(f"{batch['action'].shape=}") # (32, 64, c)
91134
break

examples/3_train_policy.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
# For this example, no arguments need to be passed because the defaults are set up for PushT.
4141
# If you're doing something different, you will likely need to change at least some of the defaults.
4242
cfg = DiffusionConfig()
43-
policy = DiffusionPolicy(cfg, dataset_stats=dataset.stats)
43+
policy = DiffusionPolicy(cfg, dataset_stats=dataset.meta.stats)
4444
policy.train()
4545
policy.to(device)
4646

0 commit comments

Comments
 (0)