-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of str::to_lowercase and str::to_uppercase #36127
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this perhaps be
ch != ch.to_lowercase()
?I'm wary about saying that
!ch.is_lowercase()
is equivalent to the logic above...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch.to_lowercase
would return anIterator
, so I'd have to create it, make it peekable, check the first character is different and then use it.The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.
If someone with Unicode knowledge can chime in it would be nice, I'm wary about it too.
From what I understand non-alphabetic characters can't be turned to lower case, and the logic is that some characters (title case and non-alphabetic ones) are neither upper case nor lower case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @SimonSapin, tweaks in the behavior of
to_lowercase
here.The optimization here is to collect up chunks of a string that are entirely lowercase and then push it all onto the string all at once (instead of one at a time). The logic is to instead test if each character is lowercase already and just maintain some indexes if that's the case.
We're worried though that this may not produce the same results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unicode is full of surprising corner cases, so I wouldn’t rely on something like this on faith.
I’m opposed to skipping
char::to_lowercase
unless you makesrc/etc/unicode.py
check exhaustively that it returns its input unchanged for ever code point where you would skip it, and that this stays the case when we updated to a new Unicode version.Here are a few random ideas to improved this.
char::to_lowercase
is implemented insrc/librustc_unicode/tables.rs
with a binary search, whilechar::is_lowercase
andchar::is_alphabetic
use a trie of bits and have a fast-path for ASCII.char::to_lowercase
could use a similarBoolTrie
to skip the binary search for code points that are unchanged (thought this would have a cost in binary size) and have its own ASCII fast path.to_lowercase_table
could be changed to store[u8; 9]
(or whatever is the maximum length) in UTF-8 rather than[char; 3]
in (effectively) UTF-32, so that pushing to aString
doesn’t need to do an encoding conversion.If doing larger copies from
&self
turns out to be significantly more efficient than pushing code points one (or up to three) at a time, theToLowercase
iterator could be extended to implementPartialEq<char>
or equivalent, to find out if the conversion was a no-op.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be possible to replace the binary searches entirely, with tries. I’ll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, tries for case mapping is not as easy as I first thought: we’d need to play some more tricks to keep it somewhat space-efficient, such as encoding ranges of code points that map to a single code point at a constant offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding
PartialEq<char>
toToLowercase
andToUppercase
would already allow the optimization to work.From what I tested the performance improvement comes from using
push_str
instead ofextend
, because thepush_str
gets optimized to amemcpy
, so the bigger the already properly cased substring is the faster it gets.I can look into adding the
PartialEq<char>
impl and using it here if you want.