Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose os.scandir through Trio's async filesystem API? #924

Open
njsmith opened this issue Feb 12, 2019 · 5 comments
Open

Expose os.scandir through Trio's async filesystem API? #924

njsmith opened this issue Feb 12, 2019 · 5 comments

Comments

@njsmith
Copy link
Member

njsmith commented Feb 12, 2019

Copying over a feature request from user "Manganus" on Discourse:


Since I, personally, hardly know how to write readable code, I regret that I won’t become a contributor to this project.

Anyway, I would love to learn how to wrap os.scandir, until someone (who knows how to write readable code) has incorporated it into trio.

The way trio.Path.iterdir works would IMHO fit perfectly (given that I understand it right.) It seems to be an async constructor that calls os.listdir in a worker thread and then returns a regular generator (that does the actual yielding of results).

The main point is the DirEntry objects. 😉

I guess the main problem is the caching of DirEntry.stat(). If the data is in cache, you don’t really wish to await the result. But if not, you wouldn’t wish to block the execution.

@njsmith
Copy link
Member Author

njsmith commented Feb 12, 2019

So there are a few things to figure out here, I think:

  • Right now, our strategy for exposing async filesystem operations is "blindly copy whatever pathlib.Path does". This works well as a way to get a reasonably-comprehensive and reasonably-familiar API for users. But it also puts as at the mercy of the pathlib maintainers :-). I'd hesitate to add a scandir wrapper without seeing how they do it, because if we do it one way and then later they do it a different way, we'll have a problem.

  • The API looks pretty awkward to wrap with our usual strategy of putting a thread around every operation that might touch the disk. os.scandir inherently returns an iterator, and each call to __next__ might touch the disk. But going in and out of a thread on every call to __next__ probably adds tons of overhead. I guess we could batch it somehow – enter thread, call __next__ on the underlying iterator a bunch of times and save the results into a buffer, leave the thread, dole out the values one at a time? And then like Manganus points out, the returned DirEntry objects have multiple methods that might-or-might-not touch the disk. And in fact the whole point is that accessing them is fast in the cases where they can avoid touching the disk. But if we have to go in-and-out of a thread to call these methods, then they'll always be slow :-(.

@Manganus
Copy link

  • os.scandir inherently returns an iterator, and each call to __next__ might touch the disk. But going in and out of a thread on every call to __next__ probably adds tons of overhead. I guess we could batch it somehow – enter thread, call __next__ on the underlying iterator a bunch of times and save the results into a buffer, leave the thread, dole out the values one at a time?

If you ask me, ;) I would say that the iterator-thing has to be sacrificed. I would exhaust the os.scandir iterator at once, and then there would be no need to os.scandir.close() later.

But it also puts us at the mercy of the pathlib maintainers :-). I'd hesitate to add a scandir wrapper without seeing how they do it, because if we do it one way and then later they do it a different way, we'll have a problem.

Agree!
But now, let's pretend for a while that we already had a trio.scandir in existence. You are surely right! Probably one would not wish to switch thread to get cached results fromtrio.DirEntry And probably not even switch task.

It's not trivial, for them who use trio.scandir, to choose between async functions and synchronous methods. So, in practice, the user of trio.scandir() can't be expected to know if a result already is in cache or not. The difference between methods and functions may be a hint: Hopefully, methods rely on working memory only. But they are only two. With the functions, on the other hand, it would differ between platforms.

A trio.DirEntry would have doubled functions.

  • One asynchronous version that propagates the usual OSErrors, and
  • one synchronous version that throws an exception if there is a need to go async, with its overhead at each call

In my inner vision I see a trio.DirEntry object that has the same methods and functions as os.DirEntry, although the functions are async (and I agree that this would be at the mercy of the os.scandir maintainers),

...the same methods and functions as os.DirEntry, but with all async functions coupled to a syncronous sibling method with almost the same name, that can be run without any task switching.

for d in await trio.scandir(pathlike):
    if ext(d.name) in ('mp3', 'flac'):
        if Condition:
            with await trio.Path.open(d) as fd:
                SomeFunction( fd )
        elif SomeOtherFunction(d.path) != mtime(await d.stat()):
            await YetAnotherFunction(d.path, d.c_inode, d.c_stat.st_dev))
    else:
        try:
            if not d.c_is_symlink:
                if d.c_is_dir:
                    subdirs += d 
        except NotInCacheError: # some platforms are lesser knowing
            if d.is_dir(follow_symlinks=False): 
                subdirs += d

@njsmith
Copy link
Member Author

njsmith commented Feb 13, 2019

If you ask me, ;) I would say that the iterator-thing has to be sacrificed. I would exhaust the os.scandir iterator at once, and then there would be no need to os.scandir.close() later.

Note that if you want to do this right now, it's pretty simple:

scandir_results = await trio.run_sync_in_worker_thread(list(os.scandir("/some/path")))

A trio.DirEntry would have doubled functions.

  • One asynchronous version that propagates the usual OSErrors, and
  • one synchronous version that throws an exception if there is a need to go async, with its overhead at each call

We have some precedent for this split – see trio.Lock.acquire vs trio.Lock.acquire_nowait.

The larger problem is that os.DirEntry doesn't provide any mechanism to figure out whether some data is cached. So the only way we could implement this in practice would be to reimplement all the scandir machinery ourselves, by making our own wrappers of readdir and FindFirstFile/FindNextFile. This is a lot of implementation complexity to support a fairly niche API... and we have similar issues then with os.walk and similar. I'm not sure if it's worth it, versus e.g. improving our threading APIs (#810) and recommending that people doing this kind of filesystem traversal shove their whole loop into a thread.

@Manganus Can you say more about what how you want to use os.scandir and Trio together?

@Manganus
Copy link

The larger problem is that os.DirEntry doesn't provide any mechanism to figure out whether some data is cached. So the only way we could implement this in practice would be to reimplement all the scandir machinery ourselves, by making our own wrappers of readdir and FindFirstFile/FindNextFile.

My instincts point in this direction:

  • lstat at the same time as calling _scandisk.c, to learn if the FS in question caches d_type
  • exhaust os.scandir and create a sequence (or set) of trio.DirEntry objects containing each os.DirEntry object with three bool flags in the trio.DirEntry object
    -- d_type in cache (calculated from that lstat, would start with True in many/most cases)
    -- inode in cache (defaults to False on Windows, True on posix)
    -- stat in cache (defaults to False)
  • update these flags from the object's functions
  • and use the flags to decide (instantly) whether the async functions can shortcut to the synchronous methods; and whether the synchronous methods throws the exception (NotInCache)

Then one actually has a choice: trio.DirEntry might expose those three flags for the interested user.

But scandir is far from a matter of life or death for me. And I have not yet made any attempt along the lines here. And I hope to stay on a slightly higher level than readdir and FindFirstFile/FindNextFile. Definitely!

class my.DirEntry(): # pseudocode!

    stat_cached  : bool
    dent_cached  : bool
    inod_cached  : bool
    os_DirEntry  : object # i.e. pointer
    
    def cached_inode(self):
        if not self.inod_cached:
            raise NotInCacheError
        return self.os_DirEntry.inode() # synchronously
    
    async def inode(self):
        if self.inod_cached:
            return self.cached_inode()
        retval = await self.os_DirEntry.inode()
        self.inod_cached = True
        return retval
    
    def cached_is_symlink(self):
        if not self.dent_cached:
            raise NotInCacheError
        return self.os_DirEntry.is_symlink()
    
    async def is_symlink(self):
        if self.dent_cached:
            return self.cached_is_symlink()
        retval = await self.os_DirEntry.is_symlink()
        self.dent_cached = True
        return retval
    
    def cached_is_dir(self): # and similarly for is_file(self)
        if not self.dent_cached:
            raise NotInCacheError
        if self.cached_is_symlink():
            handle this case
        else:
            return self.os_DirEntry.is_dir()
        
    async def is_dir(self):
        if self.dent_cached:
            IS_SYMLINK = self.cached_is_symlink()
        else:
            IS_SYMLINK = await self.is_symlink()
        if IS_SYMLINK:
            handle this case
        else:
            return self.os_DirEntry.is_dir()
    
    def cached_inode(self):
        if not self.inod_cached:
            raise NotInCacheError
        return self.os_DirEntry.inode() # synchronously
    
    async def inode(self):
        if self.inod_cached:
            return self.cached_inode()
        retval = await self.os_DirEntry.inode()
        self.inod_cached = True
        return retval
        
    
    def cached_stat(self):
        if not self.stat_cached:
            raise NotInCacheError
        return self.os_DirEntry.stat() # synchronously
    
    async def stat(self):
        if self.stat_cached:
            return self.cached_stat()
        retval = await self.os_DirEntry.stat()
        self.stat_cached = True
        self.dent_cached = True
        self.inod_cached = True
        return retval    

@Manganus Can you say more about what how you want to use os.scandir and Trio together?

I would avoid premature optimization and initially use await d.is_dir() liberally. When bottlenecks are identified, it's time to exchange them to synchronous d.cached_is_dir. And probably I would learn by trial and error where in the code one may get NotInCacheError.

And maybe I would use the flags:

if d.inod_cached:
    act_inode = d.cached_inode
else:
    act_inode = await d.inode()
if d.stat_cached: 
    act_mtime = mtime(d.cached_stat)
else:
    act_mtime = await mtime(d.stat())

@Manganus
Copy link

scandir_results = await trio.run_sync_in_worker_thread(list(os.scandir("/some/path")))

Thanks! This seems to work:
for entry in await trio.run_sync_in_worker_thread(my_scandir, actdir):
where my_scandir does little more than creating a collection of os.DirEntries.

@Manganus Can you say more about what how you want to use os.scandir and Trio together?

It is more efficient to filter results from scandir than from listdir - not only in terms of time, but there are less opportunities to create bugs if I can use its cached results instead of creating own caches.
(...and you get the inode and is_dir/ìs_file without extra waiting time on most *nix systems).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants