Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for pin_write with type="file" #201

Merged
merged 24 commits into from
Aug 7, 2023

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jul 15, 2023

This PR adds support for pin_write with type="file".

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 17, 2023

cc @machow if you wouldn't mind taking a look here!

@machow
Copy link
Collaborator

machow commented Jul 17, 2023

@cpcloud thanks--this looks really handy!

This is more an issue with me not having completed the implementation of type="file" in pins, but it looks like it might not work with cloud boards. I wonder if it might be useful for pin_read() with type="file" to return the fsspec file directly, rather than the path? Basically pins wraps the fsspec file object in a fsspec cache class.

Here's how I tested against s3:

# see .env.dev for configuration
from dotenv import load_dotenv
load_dotenv()

from pins.tests.helpers import BoardBuilder
bb = BoardBuilder("s3")

board = bb.create_tmp_board("pins/tests/pins-compat")

board.pin_write("abcd.x.y.z", "cool_pin", type="file")
board.pin_read("cool_pin")

WDYT of reading type="file" returning the fsspec file object (which may point to the cache on disk)? Worst case scenario, the R pins library has functions called pin_upload() and pin_download() that implement exactly this behavior (download file and return its path on disk; #32)

I'm away for the week so may be a bit slow to respond, but am happy to dig in when I get back!

@machow
Copy link
Collaborator

machow commented Jul 17, 2023

I ran the tests on a dev branch, since I can't remember how to get secrets into an external PR (or it's weirdly hard?),

        elif meta.type == "file":
            # TODO: update to handle multiple files
>           return [str(Path(fs.open(path_to_file).name).absolute())]
E           AttributeError: 'AzureBlobFile' object has no attribute 'name'

https://github.com/rstudio/pins-python/actions/runs/5579519223/jobs/10195188387

I realize this process is nuts, since it hits a ton of buckets, and am definitely happy to do a lot of the work here if it helps!

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 17, 2023

@machow Thanks!

I tested this with gcs and even rolled it into an ibis PR so I don't think it's a general cloud boards issue.

I'm a bit -1 on returning the result of open since that can leak resources without a with statement, which an end user could do, but then pin_read would some times be a context manager and sometimes not which feels a bit awkward IMO.

My end goal is to be able get all the on disk paths for a given pin, since that is what we're passing to the various readers in ibis.

I think I may be misunderstanding what kind of thing path_to_file is. Is it a str that refers to path on disk, or can it be an object in cloud storage, or can it either?

@machow
Copy link
Collaborator

machow commented Jul 18, 2023

When you construct a board, it creates the relevant fsspec filesystem (eg s3, gcs, local), so I think path_to_file is read using that filesystem (fs in the load function). I can't remember if the protocol gets included in the path, or if the filesystem itself determines where path is read (eg s3).

I'm a bit -1 on returning the result of open since that can leak resources

Fair 😅! I think I've come around to just doing what this PR does, since..

cc @juliasilge wdyt? For context..

  • R pins does this via pin_upload/download, but doesn't define any behavior for pin_read here
  • Since it's undefined behavior, returning the path here seems okay
  • We can always add a new function later for returning the underlying fsspec files for any type (eg if someone wants to query their parquet file directly on s3)

https://pins.rstudio.com/articles/pins.html#reading-and-writing-files

(I have to set this down but will pick up Monday when I'm back!)

@juliasilge
Copy link
Member

I am also something of a -1 on pin_read returning a path or fsspec object in the pins context, just because of how hard it is to reason about what a function or method returns when the return type depends on an argument. Like @cpcloud this strikes me as not great:

then pin_read would some times be a context manager and sometimes not

On the R side, this type of task is handled by pin_upload() / pin_download(); notice that pin_download() returns a vector of paths instead of the contents of the pin like pin_read().

Is it doable to implement pin_upload / pin_download methods, perhaps returning the fsspec file object, instead of adding type = "file"?

@juliasilge
Copy link
Member

This article might be a relevant thing to take a look at, for how we recommend folks handle files they want to read/write in a custom way:
https://pins.rstudio.com/articles/managing-custom-formats.html

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 18, 2023

Hm, I was hoping my use case isn't that custom: I don't want a DataFrame out of pin_x, I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

@juliasilge
Copy link
Member

Hm, I was hoping my use case isn't that custom

Oh for sure, maybe I just made things more confusing by showing what folks can use pin_upload / pin_download for.

I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

My opinion is yes, for consistency across our language implementations of pins. Neither is implemented yet though for Python:

raise NotImplementedError()

What do you all think of these methods returning the fsspec file directly, rather than the path, like @machow mentioned?

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 18, 2023

For my use case it would be less than ideal to get back an fsspec path, unless I can get a filename from that to the cached path on disk.

I'd like pins to be able to handle the downloading, caching, and checking for new data without having to deal with whatever cloud file system thing that ultimately backs the pin.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 18, 2023

Sorry, to be clear what I mean is my ideal signature is this:

def pin_download(...) -> list[Path]: # or list[str]

@juliasilge
Copy link
Member

def pin_download(...) -> list[Path]: # or list[str]

This is extremely consistent with the R pin_download so to me, this sounds like the best way to go. 👍

@machow
Copy link
Collaborator

machow commented Jul 24, 2023

I think this all makes sense! Just to be sure to recap:

  • @cpcloud's implementation currently has pin_read() return the paths.
  • Let's use this exact behavior, but change it from using pin_read() to using pin_download() / pin_upload().

@cpcloud do you mind if I pick up the PR from here? (should be able to make the changes tomorrow). Since pin_download() and pin_upload() aren't implemented, I'll need to make some quick tweaks, but shouldn't be too bad!

@cpcloud
Copy link
Contributor Author

cpcloud commented Jul 24, 2023

Sgtm, please take it over!

@jimhester
Copy link

Just wanted to chime in and say the proposed API would fit our needs perfectly as well, so thanks for working on this 🙏

@machow
Copy link
Collaborator

machow commented Jul 27, 2023

Alright, tests are passing with basic pin_upload() / pin_download() behavior.

Currently,

  • pin_upload() just calls pin_write(), which always writes the file name as {pin_name}.{suffixes}
    • E.g. uploading the file "mtcars.csv" to the pin "my_cool_pin" names the file "my_cool_pin.csv" (but it should preserve mtcars.csv)
    • Need to pull out most of pin_write() into a separate method, so we can preserve the exact file name
  • pin_download() checks to ensure there's a cache to download to, errors otherwise.
  • pin_read() fails as expected for type="file"

(Will also add pin_download/upload to docs)

pins/boards.py Outdated
if type == "file":
# the file type makes the name of the data the exact filename, rather
# than the pin name + a suffix (e.g. my_pin.csv).
object_name = Path(x).with_suffix("").name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to do something like

p = Path(x)
p = p.name[:-len("".join(p.suffixes))]

to handle paths like file.csv.gz.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! I tweaked a tiny bit for the case where a file has no suffixes, and added tests for the no suffix, multiple suffix cases!

@machow machow requested a review from isabelizimm July 28, 2023 20:21
@machow
Copy link
Collaborator

machow commented Jul 28, 2023

@isabelizimm when you get the chance, do you mind taking a look? I think the pin_download() / pin_upload() behavior should be implemented! (Though we still need to document them)

Copy link
Collaborator

@isabelizimm isabelizimm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! A few small questions. @machow are you planning to add docs in this PR? I'm also happy to update docstrings and add to the docs (either in this PR or a subsequent one).

assert df.x.tolist() == [1, 2, 3]

with pytest.raises(NotImplementedError):
board_with_cache.pin_read("cool_pin")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point people to pin_download when they try to read a file in this error message? Right now it is just NotImplementedError: No driver for type file.

x, name, type, title, description, metadata, versioned, created
)

def pin_download(self, name, version=None, hash=None) -> Sequence[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a question for my understanding: is the outcome of pin_download to help find the path to download the pin, rather than actually downloading the pin?

Copy link
Collaborator

@machow machow Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It downloads the pin, and returns the path to the downloaded pin (so I think the second thing?!).

There's a sort of important architecture piece here. In R pins, content always has to be downloaded first and put onto disk somewhere. Python pins uses a file object, which hides the implementation behavior of how the content is retrieved. e.g. some_file.read() can save to an on-disk cache if it wants to--and that's the default behavior--but it's not necessary. However, because pin_download() requires returning a path to a file on disk, python pins can only do that in cases where there is an on-disk cache.

That's why we need this lil cache check:

        # could also check whether f isinstance of PinCache
        fname = getattr(f, "name", None)

        if fname is None:
            raise PinsError("pin_download requires a cache.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes! Thanks for the explanation. That all makes sense--I suppose I was looking for a user-defined path where things got downloaded to, rather than getting downloaded to a set location (cache).

@machow
Copy link
Collaborator

machow commented Aug 7, 2023

@isabelizimm thanks for looking through the PR! I can chuck the methods into the docs, once #207 is merged!

@machow machow merged commit a4aecfc into rstudio:main Aug 7, 2023
@cpcloud cpcloud deleted the support-pin-write-with-files branch August 7, 2023 18:16
@cpcloud
Copy link
Contributor Author

cpcloud commented Aug 7, 2023

@machow @isabelizimm Thanks for pushing this through ❤️

@jimhester
Copy link

Thanks again for working on this!

Is there any ETA on when a release with this change will be published?

I know from experience it is annoying when people ask this 😄, but we have a use case it would be perfect for it and I am trying to figure out if I should wait for a release or pull these changes in sooner manually.

@isabelizimm
Copy link
Collaborator

Great question @jimhester! We're planning on a release by the end of August; it's on the to-do list for @machow and I as I transition into the maintainer role. 😄

@isabelizimm
Copy link
Collaborator

Following up here-- @jimhester pins 0.8.2 has been released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants