Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenX/RLDS to LerobotDataset v2 #747

Closed
wants to merge 3 commits into from
Closed

Conversation

Tavish9
Copy link

@Tavish9 Tavish9 commented Feb 18, 2025

What this does

This PR adds functionality for converting datasets from openx/rlds format to lerobot dataset v2.0 format.

Title Label
OpenX / RLDS → LeRobot v2.0 (🗃️ Dataset)

How it was tested

Examples:

    python examples/port_datasets/openx_rlds.py \
        --raw-dir /path/to/bridge_orig/1.0.0 \
        --local-dir /path/to/local_dir \
        --repo-id your_id \
        --use-videos \
        --push-to-hub

Datasets Availability

The converted datasets are now accessible in huggingface 🤗.

Minimal Code Repo

The conversion code is now available at openx2lerobot. You can just install lerobot and openx2lerobot, and easily convert your datasets.

@Cadene
Copy link
Collaborator

Cadene commented Feb 19, 2025

Beautiful. Let me try it ;)

@Cadene
Copy link
Collaborator

Cadene commented Feb 19, 2025

Data looks good
https://huggingface.co/spaces/lerobot/visualize_dataset?dataset=cadene%2Fdroid&episode=1

Screenshot 2025-02-19 at 18 56 43

But for Droid it takes 7 days to process the 92,233 episodes. Thus I am updating this code to handle parallelization over nodes.

@Tavish9
Copy link
Author

Tavish9 commented Feb 20, 2025

Hi, @Cadene, many thanks to your hands-on.

I think the parallelization across nodes should be implemented based on the functionality of LeRobotDataset. As far as i know, tfds currently does not support multi-node reading, but we can specific which episodes to read for each rank. Another issue is that LeRobotDataset's add_frame method is designed for single-node behavior.

A potential solution would be to add an identity key, such as “episode_id”, to the episode_buffer.

@Cadene
Copy link
Collaborator

Cadene commented Feb 20, 2025

#758

I am thinking to use datadrove to parallelize over slurm and create n LeRobotDataset, 1 for each shard.
Then aggregate them with a new function I will write tomorrow.

https://github.com/huggingface/lerobot/pull/758/files#diff-3bab29f41f975edaae832d8234d23b2032963427b989151c057735f7b842a5b5

@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets labels Mar 4, 2025
@Tavish9 Tavish9 closed this by deleting the head repository Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants