Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archivist special remote: add support for tar archives with .tgz extension #517

Closed
loj opened this issue Oct 26, 2023 · 1 comment · Fixed by #518
Closed

archivist special remote: add support for tar archives with .tgz extension #517

loj opened this issue Oct 26, 2023 · 1 comment · Fixed by #518

Comments

@loj
Copy link

loj commented Oct 26, 2023

I'm working on building a dataset from .tgz archives using the replacement for add-archive-content demonstrated here in combination with the archivist special remote. The demo below works if the archive is a .tar.gz extension but not with .tgz. With .tgz, I need to configure the archivist.legacy-mode for a successful datalad get. Here's a quick demo:

% mkdir project
% touch project/file1.txt project/file2.txt project/file3.txt
% tar -czvf project.tgz project
% datalad create tmp && cd tmp
% cp ../project.tgz ./
% datalad save -m "add archive" project.tgz
% git annex initremote archivist type=external externaltype=archivist encryption=none autoenable=true
% archivekey=$(git annex lookupkey project.tgz)
% datalad -f json ls-file-collection tarfile project.tgz --hash md5 | jq '. | select(.type == "file")' | jq --slurp . | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - "dl+archive:${archivekey}#path={item}&size={size}" '{item}'
% filekey=$(git annex lookupkey project/file1.txt)
% archivist_uuid=$(git annex info archivist | grep 'uuid' | cut -d ' ' -f 2)
% git annex setpresentkey $filekey $archivist_uuid 1
% datalad get project/file1.txt
get(error): project/file1.txt (file) [Could not obtain 'MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.txt' -caused by- NotImplementedError]
% datalad configuration --scope local set datalad.archivist.legacy-mode=yes                                                1 !
set_configuration(ok): . [datalad.archivist.legacy-mode=yes]
% datalad get project/file1.txt                                            
[INFO   ] datalad-archives special remote is using an extraction cache under /playground/loj/abcd/tmp3/.git/datalad/tmp/archives/8bc4249de3. Remove it with DataLad's 'clean' command to save disk space. 
get(ok): project/file1.txt (file) [from archivist...]
datalad wtf
# WTF
## configuration <SENSITIVE, report disabled by configuration>
## credentials 
  - keyring: 
    - active_backends: 
      - PlaintextKeyring with no encyption v.1.0 at /home/loj/.local/share/python_keyring/keyring_pass.cfg
    - config_file: /home/loj/.config/python_keyring/keyringrc.cfg
    - data_root: /home/loj/.local/share/python_keyring
## datalad 
  - version: 0.19.3
## dependencies 
  - annexremote: 1.6.0
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 10.20221003
  - cmd:bundled-git: UNKNOWN
  - cmd:git: 2.39.2
  - cmd:ssh: 8.4p1
  - cmd:system-git: 2.39.2
  - cmd:system-ssh: 8.4p1
  - humanize: 4.8.0
  - iso8601: 2.1.0
  - keyring: 24.2.0
  - keyrings.alt: 5.0.0
  - msgpack: 1.0.7
  - platformdirs: 3.11.0
  - requests: 2.31.0
## environment 
  - LANG: en_US.UTF-8
  - LANGUAGE: en_US.UTF-8
  - LC_ALL: en_US.UTF-8
  - LC_CTYPE: en_US.UTF-8
  - PATH: /home/loj/.venvs/abcd-long/bin:/home/loj/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games
## extensions 
  - container: 
    - description: Containerized environments
    - entrypoints: 
      - datalad_container.containers_add.ContainersAdd: 
        - class: ContainersAdd
        - module: datalad_container.containers_add
        - names: 
          - containers-add
          - containers_add
      - datalad_container.containers_list.ContainersList: 
        - class: ContainersList
        - module: datalad_container.containers_list
        - names: 
          - containers-list
          - containers_list
      - datalad_container.containers_remove.ContainersRemove: 
        - class: ContainersRemove
        - module: datalad_container.containers_remove
        - names: 
          - containers-remove
          - containers_remove
      - datalad_container.containers_run.ContainersRun: 
        - class: ContainersRun
        - module: datalad_container.containers_run
        - names: 
          - containers-run
          - containers_run
    - module: datalad_container
    - version: 1.2.3
  - next: 
    - description: What is next in DataLad
    - entrypoints: 
      - datalad_next.commands.create_sibling_webdav.CreateSiblingWebDAV: 
        - class: CreateSiblingWebDAV
        - module: datalad_next.commands.create_sibling_webdav
        - names: 
          - create-sibling-webdav
      - datalad_next.commands.credentials.Credentials: 
        - class: Credentials
        - module: datalad_next.commands.credentials
        - names: 
      - datalad_next.commands.download.Download: 
        - class: Download
        - module: datalad_next.commands.download
        - names: 
          - download
      - datalad_next.commands.ls_file_collection.LsFileCollection: 
        - class: LsFileCollection
        - module: datalad_next.commands.ls_file_collection
        - names: 
          - ls-file-collection
      - datalad_next.commands.tree.TreeCommand: 
        - class: TreeCommand
        - module: datalad_next.commands.tree
        - names: 
          - tree
    - module: datalad_next
    - version: 1.0.1
## git-annex 
  - build flags: 
    - Assistant
    - Webapp
    - Pairing
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Benchmark
    - Feeds
    - Testsuite
    - S3
    - WebDAV
  - dependency versions: 
    - aws-0.22
    - bloomfilter-2.0.1.0
    - cryptonite-0.26
    - DAV-1.3.4
    - feed-1.3.0.1
    - ghc-8.8.4
    - http-client-0.6.4.1
    - persistent-sqlite-2.10.6.2
    - torrent-10000.1.1
    - uuid-1.3.13
    - yesod-1.6.1.0
  - key/value backends: 
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
    - X*
  - operating system: linux x86_64
  - remote types: 
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - httpalso
    - borg
    - hook
    - external
  - supported repository versions: 
    - 8
    - 9
    - 10
  - upgrade supported from repository versions: 
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7
    - 8
    - 9
    - 10
  - version: 10.20221003
## location 
  - path: /playground/loj/abcd
  - type: directory
## metadata.extractors 
  - container_inspect: 
    - distribution: datalad-container 1.2.3
    - load_error: ModuleNotFoundError(No module named 'datalad_metalad')
    - module: datalad_container.extractors.metalad_container
## metadata.filters 
## metadata.indexers 
## python 
  - implementation: CPython
  - version: 3.9.2
## system 
  - distribution: debian/11/bullseye
  - encoding: 
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - filesystem: 
    - CWD: 
      - path: /playground/loj/abcd
    - HOME: 
      - path: /home/loj
    - TMP: 
      - path: /tmp
  - max_path_length: 276
  - name: Linux
  - release: 5.10.0-23-amd64
  - type: posix
  - version: #1 SMP Debian 5.10.179-1 (2023-05-12)

@mih
Copy link
Member

mih commented Oct 26, 2023

Thanks a lot for the excellent report that made it easy to spot the issue. There are two things that can be done here. The problem is indeed the .tgz extension not being used to detect the archive type.

Fix 1:

You can declare the archive type in the URL. The adjusted addurls call that does this is:

datalad -f json ls-file-collection tarfile project.tgz --hash md5 | jq '. | select(.type == "file")' | jq --slurp . | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - "dl+archive:${archivekey}#path={item}&size={size}&atype=tar" '{item}'

(look for atype=). The docs on this are at https://docs.datalad.org/projects/next/en/latest/generated/generated/datalad_next.types.archivist.html#syntax-of-dl-archives-locators

Fix 2:

The following patch would make this unnecessary, and I think it is sensible to recognize .tgz as a TAR archive.

diff --git a/datalad_next/types/archivist.py b/datalad_next/types/archivist.py
index 12e9b2b..3c1ab49 100644
--- a/datalad_next/types/archivist.py
+++ b/datalad_next/types/archivist.py
@@ -134,6 +134,8 @@ class ArchivistLocator:
                 atype = ArchiveType.zip
             elif '.tar' in suf:
                 atype = ArchiveType.tar
+            elif '.tgz' in suf:
+                atype = ArchiveType.tar
 
         return cls(
             akey=akey,

I will propose a PR.

mih added a commit to mih/datalad-next that referenced this issue Oct 26, 2023
This seems to be at least as sensible as `.tar`.

Closes datalad#517
@mih mih closed this as completed in #518 Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants