Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of str::to_lowercase and str::to_uppercase #36127

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 60 additions & 15 deletions src/libcollections/str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1634,30 +1634,53 @@ impl str {
#[stable(feature = "unicode_case_mapping", since = "1.2.0")]
pub fn to_lowercase(&self) -> String {
let mut s = String::with_capacity(self.len());
for (i, c) in self[..].char_indices() {
if c == 'Σ' {
// Σ maps to σ, except at the end of a word where it maps to ς.
// This is the only conditional (contextual) but language-independent mapping
// in `SpecialCasing.txt`,
// so hard-code it rather than have a generic "condition" mechanism.
// See https://github.com/rust-lang/rust/issues/26035
map_uppercase_sigma(self, i, &mut s)
} else {
s.extend(c.to_lowercase());
let mut left = None;

// Try to collect slices of lower case characters to push into the
// result or extend with the lower case version if an upper case
// character is found.
for (i, ch) in self.char_indices() {
if !ch.is_lowercase() && ch.is_alphabetic() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps be ch != ch.to_lowercase()?

I'm wary about saying that !ch.is_lowercase() is equivalent to the logic above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ch.to_lowercase would return an Iterator, so I'd have to create it, make it peekable, check the first character is different and then use it.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

If someone with Unicode knowledge can chime in it would be nice, I'm wary about it too.

From what I understand non-alphabetic characters can't be turned to lower case, and the logic is that some characters (title case and non-alphabetic ones) are neither upper case nor lower case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @SimonSapin, tweaks in the behavior of to_lowercase here.

The optimization here is to collect up chunks of a string that are entirely lowercase and then push it all onto the string all at once (instead of one at a time). The logic is to instead test if each character is lowercase already and just maintain some indexes if that's the case.

We're worried though that this may not produce the same results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand non-alphabetic characters can't be turned to lower case,

Unicode is full of surprising corner cases, so I wouldn’t rely on something like this on faith.

I’m opposed to skipping char::to_lowercase unless you make src/etc/unicode.py check exhaustively that it returns its input unchanged for ever code point where you would skip it, and that this stays the case when we updated to a new Unicode version.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

Here are a few random ideas to improved this.

char::to_lowercase is implemented in src/librustc_unicode/tables.rs with a binary search, while char::is_lowercase and char::is_alphabetic use a trie of bits and have a fast-path for ASCII.

char::to_lowercase could use a similar BoolTrie to skip the binary search for code points that are unchanged (thought this would have a cost in binary size) and have its own ASCII fast path.

to_lowercase_table could be changed to store [u8; 9] (or whatever is the maximum length) in UTF-8 rather than [char; 3] in (effectively) UTF-32, so that pushing to a String doesn’t need to do an encoding conversion.

If doing larger copies from &self turns out to be significantly more efficient than pushing code points one (or up to three) at a time, the ToLowercase iterator could be extended to implement PartialEq<char> or equivalent, to find out if the conversion was a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be possible to replace the binary searches entirely, with tries. I’ll look into that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tries for case mapping is not as easy as I first thought: we’d need to play some more tricks to keep it somewhat space-efficient, such as encoding ranges of code points that map to a single code point at a constant offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding PartialEq<char> to ToLowercase and ToUppercase would already allow the optimization to work.

From what I tested the performance improvement comes from using push_str instead of extend, because the push_str gets optimized to a memcpy, so the bigger the already properly cased substring is the faster it gets.

I can look into adding the PartialEq<char> impl and using it here if you want.

if let Some(offset) = left.take() {
s.push_str(&self[offset..i]);
}

if ch == 'Σ' {
// Σ maps to σ, except at the end of a word where it maps to ς.
// This is the only conditional (contextual) but language-independent mapping
// in `SpecialCasing.txt`,
// so hard-code it rather than have a generic "condition" mechanism.
// See https://github.com/rust-lang/rust/issues/26035

map_uppercase_sigma(self, i, &mut s);
}
else {
s.extend(ch.to_lowercase());
}
}
else if left.is_none() {
left = Some(i);
}
}

// Append any leftover upper case characters.
if let Some(offset) = left.take() {
s.push_str(&self[offset..]);
}

return s;

fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
// See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
// for the definition of `Final_Sigma`.
debug_assert!('Σ'.len_utf8() == 2);
let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
!case_ignoreable_then_cased(from[i + 2..].chars());
to.push_str(if is_word_final {
"ς"
!case_ignoreable_then_cased(from[i + 2..].chars());

to.push(if is_word_final {
'ς'
} else {
"σ"
'σ'
});
}

Expand Down Expand Up @@ -1697,7 +1720,29 @@ impl str {
#[stable(feature = "unicode_case_mapping", since = "1.2.0")]
pub fn to_uppercase(&self) -> String {
let mut s = String::with_capacity(self.len());
s.extend(self.chars().flat_map(|c| c.to_uppercase()));
let mut left = None;

// Try to collect slices of upper case characters to push into the
// result or extend with the upper case version if a lower case
// character is found.
for (i, ch) in self.char_indices() {
if !ch.is_uppercase() && ch.is_alphabetic() {
if let Some(offset) = left.take() {
s.push_str(&self[offset..i]);
}

s.extend(ch.to_uppercase());
}
else if left.is_none() {
left = Some(i);
}
}

// Append any leftover upper case characters.
if let Some(offset) = left.take() {
s.push_str(&self[offset..]);
}

return s;
}

Expand Down