-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for pin_write
with type="file"
#201
Conversation
67e6655
to
65a2be3
Compare
cc @machow if you wouldn't mind taking a look here! |
@cpcloud thanks--this looks really handy! This is more an issue with me not having completed the implementation of Here's how I tested against s3: # see .env.dev for configuration
from dotenv import load_dotenv
load_dotenv()
from pins.tests.helpers import BoardBuilder
bb = BoardBuilder("s3")
board = bb.create_tmp_board("pins/tests/pins-compat")
board.pin_write("abcd.x.y.z", "cool_pin", type="file")
board.pin_read("cool_pin") WDYT of reading type="file" returning the fsspec file object (which may point to the cache on disk)? Worst case scenario, the R pins library has functions called I'm away for the week so may be a bit slow to respond, but am happy to dig in when I get back! |
I ran the tests on a dev branch, since I can't remember how to get secrets into an external PR (or it's weirdly hard?),
https://github.com/rstudio/pins-python/actions/runs/5579519223/jobs/10195188387 I realize this process is nuts, since it hits a ton of buckets, and am definitely happy to do a lot of the work here if it helps! |
@machow Thanks! I tested this with gcs and even rolled it into an ibis PR so I don't think it's a general cloud boards issue. I'm a bit -1 on returning the result of My end goal is to be able get all the on disk paths for a given pin, since that is what we're passing to the various readers in ibis. I think I may be misunderstanding what kind of thing |
When you construct a board, it creates the relevant fsspec filesystem (eg s3, gcs, local), so I think
Fair 😅! I think I've come around to just doing what this PR does, since.. cc @juliasilge wdyt? For context..
https://pins.rstudio.com/articles/pins.html#reading-and-writing-files (I have to set this down but will pick up Monday when I'm back!) |
I am also something of a -1 on
On the R side, this type of task is handled by Is it doable to implement |
This article might be a relevant thing to take a look at, for how we recommend folks handle files they want to read/write in a custom way: |
Hm, I was hoping my use case isn't that custom: I don't want a DataFrame out of |
Oh for sure, maybe I just made things more confusing by showing what folks can use
My opinion is yes, for consistency across our language implementations of pins. Neither is implemented yet though for Python: Line 347 in 7ded069
What do you all think of these methods returning the fsspec file directly, rather than the path, like @machow mentioned? |
For my use case it would be less than ideal to get back an fsspec path, unless I can get a filename from that to the cached path on disk. I'd like pins to be able to handle the downloading, caching, and checking for new data without having to deal with whatever cloud file system thing that ultimately backs the pin. |
Sorry, to be clear what I mean is my ideal signature is this:
|
This is extremely consistent with the R |
I think this all makes sense! Just to be sure to recap:
@cpcloud do you mind if I pick up the PR from here? (should be able to make the changes tomorrow). Since |
Sgtm, please take it over! |
Just wanted to chime in and say the proposed API would fit our needs perfectly as well, so thanks for working on this 🙏 |
Alright, tests are passing with basic Currently,
(Will also add pin_download/upload to docs) |
pins/boards.py
Outdated
if type == "file": | ||
# the file type makes the name of the data the exact filename, rather | ||
# than the pin name + a suffix (e.g. my_pin.csv). | ||
object_name = Path(x).with_suffix("").name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may need to do something like
p = Path(x)
p = p.name[:-len("".join(p.suffixes))]
to handle paths like file.csv.gz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! I tweaked a tiny bit for the case where a file has no suffixes, and added tests for the no suffix, multiple suffix cases!
@isabelizimm when you get the chance, do you mind taking a look? I think the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! A few small questions. @machow are you planning to add docs in this PR? I'm also happy to update docstrings and add to the docs (either in this PR or a subsequent one).
assert df.x.tolist() == [1, 2, 3] | ||
|
||
with pytest.raises(NotImplementedError): | ||
board_with_cache.pin_read("cool_pin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point people to pin_download
when they try to read a file in this error message? Right now it is just NotImplementedError: No driver for type file
.
x, name, type, title, description, metadata, versioned, created | ||
) | ||
|
||
def pin_download(self, name, version=None, hash=None) -> Sequence[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a question for my understanding: is the outcome of pin_download
to help find the path to download the pin, rather than actually downloading the pin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It downloads the pin, and returns the path to the downloaded pin (so I think the second thing?!).
There's a sort of important architecture piece here. In R pins, content always has to be downloaded first and put onto disk somewhere. Python pins uses a file object, which hides the implementation behavior of how the content is retrieved. e.g. some_file.read()
can save to an on-disk cache if it wants to--and that's the default behavior--but it's not necessary. However, because pin_download()
requires returning a path to a file on disk, python pins can only do that in cases where there is an on-disk cache.
That's why we need this lil cache check:
# could also check whether f isinstance of PinCache
fname = getattr(f, "name", None)
if fname is None:
raise PinsError("pin_download requires a cache.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes! Thanks for the explanation. That all makes sense--I suppose I was looking for a user-defined path where things got downloaded to, rather than getting downloaded to a set location (cache).
@isabelizimm thanks for looking through the PR! I can chuck the methods into the docs, once #207 is merged! |
@machow @isabelizimm Thanks for pushing this through ❤️ |
Thanks again for working on this! Is there any ETA on when a release with this change will be published? I know from experience it is annoying when people ask this 😄, but we have a use case it would be perfect for it and I am trying to figure out if I should wait for a release or pull these changes in sooner manually. |
Great question @jimhester! We're planning on a release by the end of August; it's on the to-do list for @machow and I as I transition into the maintainer role. 😄 |
Following up here-- @jimhester pins 0.8.2 has been released! |
This PR adds support for
pin_write
withtype="file"
.