huggingface · AdilZouitine · Dec 20, 2024 · Dec 23, 2024 · Jan 3, 2025 · Jan 8, 2025
diff --git a/.github/workflows/quality.yml b/.github/workflows/quality.yml
@@ -50,7 +50,7 @@ jobs:
         uses: actions/checkout@v3
 
       - name: Install poetry
-        run: pipx install poetry
+        run: pipx install "poetry<2.0.0"
 
       - name: Poetry check
         run: poetry check
@@ -64,7 +64,7 @@ jobs:
         uses: actions/checkout@v3
 
       - name: Install poetry
-        run: pipx install poetry
+        run: pipx install "poetry<2.0.0"
 
       - name: Install poetry-relax
         run: poetry self add poetry-relax

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -17,6 +17,7 @@ repos:
     rev: v3.19.0
     hooks:
     -   id: pyupgrade
+        exclude: '^(.*_pb2_grpc\.py|.*_pb2\.py$)'
   - repo: https://github.com/astral-sh/ruff-pre-commit
     rev: v0.8.2
     hooks:

diff --git a/README.md b/README.md
@@ -68,7 +68,7 @@
 
 ### Acknowledgment
 
-- Thanks to Tony Zaho, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
+- Thanks to Tony Zhao, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
 - Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
 - Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
 - Thanks to Antonio Loquercio and Ashish Kumar for their early support.

diff --git a/benchmarks/video/README.md b/benchmarks/video/README.md
@@ -21,7 +21,7 @@ How to decode videos?
 
 ## Variables
 **Image content & size**
-We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
+We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an apartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
 For these reasons, we run this benchmark on four representative datasets:
 - `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera.
 - `aliberts/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera.
@@ -63,7 +63,7 @@ This of course is affected by the `-g` parameter during encoding, which specifie
 
 Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`.
 
-Additionally, because some policies might request single timestamps that are a few frames appart, we also have the following scenario:
+Additionally, because some policies might request single timestamps that are a few frames apart, we also have the following scenario:
 - `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`),
 
 However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded.
@@ -85,8 +85,8 @@ However, due to how video decoding is implemented with `pyav`, we don't have acc
 **Average Structural Similarity Index Measure (higher is better)**
 `avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.
 
-One aspect that can't be measured here with those metrics is the compatibility of the encoding accross platforms, in particular on web browser, for visualization purposes.
-h264, h265 and AV1 are all commonly used codecs and should not be pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
+One aspect that can't be measured here with those metrics is the compatibility of the encoding across platforms, in particular on web browser, for visualization purposes.
+h264, h265 and AV1 are all commonly used codecs and should not pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
 - `yuv420p` is more widely supported across various platforms, including web browsers.
 - `yuv444p` offers higher color fidelity but might not be supported as broadly.
 
@@ -116,7 +116,7 @@ Additional encoding parameters exist that are not included in this benchmark. In
 - `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1.
 - `-tune` which allows to optimize the encoding for certains aspects (e.g. film quality, fast decoding, etc.).
 
-See the documentation mentioned above for more detailled info on these settings and for a more comprehensive list of other parameters.
+See the documentation mentioned above for more detailed info on these settings and for a more comprehensive list of other parameters.
 
 Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few:
 - `torchaudio`

diff --git a/benchmarks/video/run_video_benchmark.py b/benchmarks/video/run_video_benchmark.py
@@ -32,7 +32,11 @@
 import pandas as pd
 import PIL
 import torch
-from skimage.metrics import mean_squared_error, peak_signal_noise_ratio, structural_similarity
+from skimage.metrics import (
+    mean_squared_error,
+    peak_signal_noise_ratio,
+    structural_similarity,
+)
 from tqdm import tqdm
 
 from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
@@ -81,7 +85,9 @@ def get_directory_size(directory: Path) -> int:
     return total_size
 
 
-def load_original_frames(imgs_dir: Path, timestamps: list[float], fps: int) -> torch.Tensor:
+def load_original_frames(
+    imgs_dir: Path, timestamps: list[float], fps: int
+) -> torch.Tensor:
     frames = []
     for ts in timestamps:
         idx = int(ts * fps)
@@ -94,7 +100,11 @@ def load_original_frames(imgs_dir: Path, timestamps: list[float], fps: int) -> t
 
 
 def save_decoded_frames(
-    imgs_dir: Path, save_dir: Path, frames: torch.Tensor, timestamps: list[float], fps: int
+    imgs_dir: Path,
+    save_dir: Path,
+    frames: torch.Tensor,
+    timestamps: list[float],
+    fps: int,
 ) -> None:
     if save_dir.exists() and len(list(save_dir.glob("frame_*.png"))) == len(timestamps):
         return
@@ -104,7 +114,10 @@ def save_decoded_frames(
         idx = int(ts * fps)
         frame_hwc = (frames[i].permute((1, 2, 0)) * 255).type(torch.uint8).cpu().numpy()
         PIL.Image.fromarray(frame_hwc).save(save_dir / f"frame_{idx:06d}_decoded.png")
-        shutil.copyfile(imgs_dir / f"frame_{idx:06d}.png", save_dir / f"frame_{idx:06d}_original.png")
+        shutil.copyfile(
+            imgs_dir / f"frame_{idx:06d}.png",
+            save_dir / f"frame_{idx:06d}_original.png",
+        )
 
 
 def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None:
@@ -116,11 +129,17 @@ def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None:
     hf_dataset = dataset.hf_dataset.with_format(None)
 
     # We only save images from the first camera
-    img_keys = [key for key in hf_dataset.features if key.startswith("observation.image")]
+    img_keys = [
+        key for key in hf_dataset.features if key.startswith("observation.image")
+    ]
     imgs_dataset = hf_dataset.select_columns(img_keys[0])
 
     for i, item in enumerate(
-        tqdm(imgs_dataset, desc=f"saving {dataset.repo_id} first episode images", leave=False)
+        tqdm(
+            imgs_dataset,
+            desc=f"saving {dataset.repo_id} first episode images",
+            leave=False,
+        )
     ):
         img = item[img_keys[0]]
         img.save(str(imgs_dir / f"frame_{i:06d}.png"), quality=100)
@@ -129,7 +148,9 @@ def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None:
             break
 
 
-def sample_timestamps(timestamps_mode: str, ep_num_images: int, fps: int) -> list[float]:
+def sample_timestamps(
+    timestamps_mode: str, ep_num_images: int, fps: int
+) -> list[float]:
     # Start at 5 to allow for 2_frames_4_space and 6_frames
     idx = random.randint(5, ep_num_images - 1)
     match timestamps_mode:
@@ -154,7 +175,9 @@ def decode_video_frames(
     backend: str,
 ) -> torch.Tensor:
     if backend in ["pyav", "video_reader"]:
-        return decode_video_frames_torchvision(video_path, timestamps, tolerance_s, backend)
+        return decode_video_frames_torchvision(
+            video_path, timestamps, tolerance_s, backend
+        )
     else:
         raise NotImplementedError(backend)
 
@@ -181,7 +204,9 @@ def process_sample(sample: int):
         }
 
         with time_benchmark:
-            frames = decode_video_frames(video_path, timestamps=timestamps, tolerance_s=5e-1, backend=backend)
+            frames = decode_video_frames(
+                video_path, timestamps=timestamps, tolerance_s=5e-1, backend=backend
+            )
         result["load_time_video_ms"] = time_benchmark.result_ms / num_frames
 
         with time_benchmark:
@@ -190,12 +215,18 @@ def process_sample(sample: int):
 
         frames_np, original_frames_np = frames.numpy(), original_frames.numpy()
         for i in range(num_frames):
-            result["mse_values"].append(mean_squared_error(original_frames_np[i], frames_np[i]))
+            result["mse_values"].append(
+                mean_squared_error(original_frames_np[i], frames_np[i])
+            )
             result["psnr_values"].append(
-                peak_signal_noise_ratio(original_frames_np[i], frames_np[i], data_range=1.0)
+                peak_signal_noise_ratio(
+                    original_frames_np[i], frames_np[i], data_range=1.0
+                )
             )
             result["ssim_values"].append(
-                structural_similarity(original_frames_np[i], frames_np[i], data_range=1.0, channel_axis=0)
+                structural_similarity(
+                    original_frames_np[i], frames_np[i], data_range=1.0, channel_axis=0
+                )
             )
 
         if save_frames and sample == 0:
@@ -215,7 +246,9 @@ def process_sample(sample: int):
     # As these samples are independent, we run them in parallel threads to speed up the benchmark.
     with ThreadPoolExecutor(max_workers=num_workers) as executor:
         futures = [executor.submit(process_sample, i) for i in range(num_samples)]
-        for future in tqdm(as_completed(futures), total=num_samples, desc="samples", leave=False):
+        for future in tqdm(
+            as_completed(futures), total=num_samples, desc="samples", leave=False
+        ):
             result = future.result()
             load_times_video_ms.append(result["load_time_video_ms"])
             load_times_images_ms.append(result["load_time_images_ms"])
@@ -275,9 +308,13 @@ def benchmark_encoding_decoding(
     random.seed(seed)
     benchmark_table = []
     for timestamps_mode in tqdm(
-        decoding_cfg["timestamps_modes"], desc="decodings (timestamps_modes)", leave=False
+        decoding_cfg["timestamps_modes"],
+        desc="decodings (timestamps_modes)",
+        leave=False,
     ):
-        for backend in tqdm(decoding_cfg["backends"], desc="decodings (backends)", leave=False):
+        for backend in tqdm(
+            decoding_cfg["backends"], desc="decodings (backends)", leave=False
+        ):
             benchmark_row = benchmark_decoding(
                 imgs_dir,
                 video_path,
@@ -355,14 +392,23 @@ def main(
                 imgs_dir = output_dir / "images" / dataset.repo_id.replace("/", "_")
                 # We only use the first episode
                 save_first_episode(imgs_dir, dataset)
-                for key, values in tqdm(encoding_benchmarks.items(), desc="encodings (g, crf)", leave=False):
+                for key, values in tqdm(
+                    encoding_benchmarks.items(), desc="encodings (g, crf)", leave=False
+                ):
                     for value in tqdm(values, desc=f"encodings ({key})", leave=False):
                         encoding_cfg = BASE_ENCODING.copy()
                         encoding_cfg["vcodec"] = video_codec
                         encoding_cfg["pix_fmt"] = pixel_format
                         encoding_cfg[key] = value
-                        args_path = Path("_".join(str(value) for value in encoding_cfg.values()))
-                        video_path = output_dir / "videos" / args_path / f"{repo_id.replace('/', '_')}.mp4"
+                        args_path = Path(
+                            "_".join(str(value) for value in encoding_cfg.values())
+                        )
+                        video_path = (
+                            output_dir
+                            / "videos"
+                            / args_path
+                            / f"{repo_id.replace('/', '_')}.mp4"
+                        )
                         benchmark_table += benchmark_encoding_decoding(
                             dataset,
                             video_path,
@@ -388,7 +434,9 @@ def main(
     # Concatenate all results
     df_list = [pd.read_csv(csv_path) for csv_path in file_paths]
     concatenated_df = pd.concat(df_list, ignore_index=True)
-    concatenated_path = output_dir / f"{now:%Y-%m-%d}_{now:%H-%M-%S}_all_{num_samples}-samples.csv"
+    concatenated_path = (
+        output_dir / f"{now:%Y-%m-%d}_{now:%H-%M-%S}_all_{num_samples}-samples.csv"
+    )
     concatenated_df.to_csv(concatenated_path, header=True, index=False)
 
 

diff --git a/checkport.py b/checkport.py
@@ -0,0 +1,18 @@
+import socket
+
+
+def check_port(host, port):
+    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        s.connect((host, port))
+        print(f"Connection successful to {host}:{port}!")
+    except Exception as e:
+        print(f"Connection failed to {host}:{port}: {e}")
+    finally:
+        s.close()
+
+
+if __name__ == "__main__":
+    host = "127.0.0.1"  # or "localhost"
+    port = 51350
+    check_port(host, port)
diff --git a/docker/lerobot-gpu-mani-skill/Dockerfile b/docker/lerobot-gpu-mani-skill/Dockerfile
@@ -0,0 +1,11 @@
+FROM huggingface/lerobot-gpu:latest
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libvulkan1 vulkan-tools \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+RUN pip install --upgrade --no-cache-dir pip
+RUN pip install --no-cache-dir ".[mani-skill]"
+
+# Set EGL as the rendering backend for MuJoCo
+ENV MUJOCO_GL="egl"