-
-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokei: eliminate reading files entirely into memory #148
Conversation
Cargo.lock is by-default ignored with `cargo new` - it is generated from Cargo.toml.
This was warning on tests.
This change drops memory consumption quite a bit (cpu timings are unaffected). The old behavior was to read files entirely into memory and then utf8-translate them. This was necessary because the encoding crate doesn't support readers - it only operates on whole string references. discussion about this here: lifthrasiir/rust-encoding#91 We can't just wrap the new reader in the standard BufReader, though, because the standard BufReader returns a newly allocated line for each line in the lines iterator. This would have put us right back to where we were. Instead, we can just use read_line directly on a BufReader. We do not need to worry about stripping trailing newlines. Using https://gist.github.com/netj/526585 and running tokei vs. the new tokei (after running a few times for page warming): for i in `seq 1 10` ; do ~/memusg tokei>/dev/null ; ~/memusg ~/rust/tokei/target/release/tokei>/dev/null ; done Results: memusg: peak=62468 memusg: peak=21108 memusg: peak=67736 memusg: peak=19952 memusg: peak=68984 memusg: peak=20960 memusg: peak=68208 memusg: peak=20964 memusg: peak=57968 memusg: peak=21492 memusg: peak=63468 memusg: peak=20484 memusg: peak=68512 memusg: peak=19740 memusg: peak=68600 memusg: peak=21984 memusg: peak=69860 memusg: peak=22016 memusg: peak=70932 memusg: peak=22656 As visible, every run with the new release has a a much smaller (<50%) peak memory usage footprint.
Thanks for your PR! From reading through the threads and seeing that this is also used in ripgrep. I think the decoder should be it's own crate and used by tokei. And keeping |
Agreed - I thought the same thing when I saw it. It would need a bit of refactoring to get it more generally usable, though. I'll ping @BurntSushi and potentially get his thoughts.
I did not know that! But |
Yes, the intent was always to put these transcoders in a separate crate, but I haven't had the time to do that. The transcoder in ripgrep is only correct in the way that it is being used, but not for all possible uses of the |
After reading hsivonen/encoding_rs#8 in full, there does not appear to be an easy way to refactor a utf-8 transcoder into a generic crate. I would be more willing to refactor the already-modified transcoder into its own special use case to eliminate the second BufReader. |
Closing due to conflicts. I think this would need to be a separate crate in some form before it could be merged in as I don't want to bring the maintenance of the module into the project. If this happens or there is an alternative available feel free to make another PR! |
ripgrep's transcoder is now documented and polished in an external crate: https://github.com/BurntSushi/encoding_rs_io |
This change drops memory consumption quite a bit (cpu timings are
unaffected).
The old behavior was to read files entirely into memory and then
utf8-translate them. This was necessary because the encoding crate
doesn't support readers - it only operates on whole string references.
discussion about this here:
lifthrasiir/rust-encoding#91
We can't just wrap the new reader in the standard BufReader, though,
because the standard BufReader returns a newly allocated line for each
line in the lines iterator. This would have put us right back to where
we were.
Instead, we can just use read_line directly on a BufReader. We do not
need to worry about stripping trailing newlines.
Using https://gist.github.com/netj/526585 and running tokei vs. the new
tokei (after running a few times for page warming):
Results:
memusg: peak=62468
memusg: peak=21108
memusg: peak=67736
memusg: peak=19952
memusg: peak=68984
memusg: peak=20960
memusg: peak=68208
memusg: peak=20964
memusg: peak=57968
memusg: peak=21492
memusg: peak=63468
memusg: peak=20484
memusg: peak=68512
memusg: peak=19740
memusg: peak=68600
memusg: peak=21984
memusg: peak=69860
memusg: peak=22016
memusg: peak=70932
memusg: peak=22656
As visible, every run with the new release has a a much smaller (<50%)
peak memory usage footprint.
I've ran tokei over some arbitrary files on my own system before and after this diff and the results are the same.
This solution isn't perfect: if files are large and do not have newlines, the buffered string will still eat up loads of ram. I may look into a solution that does not buffer the line into memory.