-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of str::to_lowercase and str::to_uppercase #36127
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
Thanks for the PR! Can you also be sure to add a few test cases for some of the more tricky corner cases this needs to handle as well? |
Where should I put them? I'm still not sure where the tests for stuff are supposed to go. Do you have any specific examples you're concerned about? The only actual special case that was in the previous code is the final sigma one, which is basically copied as is. |
There exists tests in src/libcollectionstest/str.rs already, check if they cover the special cases already (I think they do). |
The failing test is confusing, but my Unicode knowledge doesn't go that far. If anyone could enlighten me, these characters |
From what I understand, unicode classifies
|
So in this test, which is currently failing:
It's expecting |
dfad156
to
33322c8
Compare
@frewsxcv makes sense, thanks a bunch. Updated benchmark results:
Code for them: #![feature(test, unicode)]
extern crate test;
use test::Bencher;
extern crate rustc_unicode;
fn main() {
println!("Hello, world!");
}
fn to_uppercase(this: &str) -> String {
let mut s = String::with_capacity(this.len());
let mut left = None;
// Try to collect slices of upper case characters to push into the
// result or extend with the upper case version if a lower case
// character is found.
for (i, ch) in this.char_indices() {
if !ch.is_uppercase() {
if let Some(offset) = left.take() {
s.push_str(&this[offset..i]);
}
s.extend(ch.to_uppercase());
}
else if left.is_none() {
left = Some(i);
}
}
// Append any leftover upper case characters.
if let Some(offset) = left.take() {
s.push_str(&this[offset..]);
}
s
}
fn to_lowercase(this: &str) -> String {
let mut s = String::with_capacity(this.len());
let mut left = None;
// Try to collect slices of lower case characters to push into the
// result or extend with the lower case version if an upper case
// character is found.
for (i, ch) in this.char_indices() {
if !ch.is_lowercase() {
if let Some(offset) = left.take() {
s.push_str(&this[offset..i]);
}
if ch == 'Σ' {
// Σ maps to σ, except at the end of a word where it maps to ς.
// This is the only conditional (contextual) but language-independent mapping
// in `SpecialCasing.txt`,
// so hard-code it rather than have a generic "condition" mechanism.
// See https://github.com/rust-lang/rust/issues/26035
map_uppercase_sigma(this, i, &mut s);
}
else {
s.extend(ch.to_lowercase());
}
}
else if left.is_none() {
left = Some(i);
}
}
// Append any leftover upper case characters.
if let Some(offset) = left.take() {
s.push_str(&this[offset..]);
}
return s;
fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
// See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
// for the definition of `Final_Sigma`.
debug_assert!('Σ'.len_utf8() == 2);
let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
!case_ignoreable_then_cased(from[i + 2..].chars());
to.push(if is_word_final {
'ς'
} else {
'σ'
});
}
fn case_ignoreable_then_cased<I: Iterator<Item = char>>(iter: I) -> bool {
use rustc_unicode::derived_property::{Cased, Case_Ignorable};
match iter.skip_while(|&c| Case_Ignorable(c)).next() {
Some(c) => Cased(c),
None => false,
}
}
}
#[test]
fn std_to_lowercase() {
assert_eq!("".to_lowercase(), "");
assert_eq!("AÉDžaé ".to_lowercase(), "aédžaé ");
// https://github.com/rust-lang/rust/issues/26035
assert_eq!("ΑΣ".to_lowercase(), "ας");
assert_eq!("Α'Σ".to_lowercase(), "α'ς");
assert_eq!("Α''Σ".to_lowercase(), "α''ς");
assert_eq!("ΑΣ Α".to_lowercase(), "ας α");
assert_eq!("Α'Σ Α".to_lowercase(), "α'ς α");
assert_eq!("Α''Σ Α".to_lowercase(), "α''ς α");
assert_eq!("ΑΣ' Α".to_lowercase(), "ας' α");
assert_eq!("ΑΣ'' Α".to_lowercase(), "ας'' α");
assert_eq!("Α'Σ' Α".to_lowercase(), "α'ς' α");
assert_eq!("Α''Σ'' Α".to_lowercase(), "α''ς'' α");
assert_eq!("Α Σ".to_lowercase(), "α σ");
assert_eq!("Α 'Σ".to_lowercase(), "α 'σ");
assert_eq!("Α ''Σ".to_lowercase(), "α ''σ");
assert_eq!("Σ".to_lowercase(), "σ");
assert_eq!("'Σ".to_lowercase(), "'σ");
assert_eq!("''Σ".to_lowercase(), "''σ");
assert_eq!("ΑΣΑ".to_lowercase(), "ασα");
assert_eq!("ΑΣ'Α".to_lowercase(), "ασ'α");
assert_eq!("ΑΣ''Α".to_lowercase(), "ασ''α");
}
#[test]
fn std_to_uppercase() {
assert_eq!("".to_uppercase(), "");
assert_eq!("aéDžßfiᾀ".to_uppercase(), "AÉDŽSSFIἈΙ");
}
#[test]
fn new_to_lowercase() {
assert_eq!(to_lowercase(""), "");
assert_eq!(to_lowercase("AÉDžaé "), "aédžaé ");
// https://github.com/rust-lang/rust/issues/26035
assert_eq!(to_lowercase("ΑΣ"), "ας");
assert_eq!(to_lowercase("Α'Σ"), "α'ς");
assert_eq!(to_lowercase("Α''Σ"), "α''ς");
assert_eq!(to_lowercase("ΑΣ Α"), "ας α");
assert_eq!(to_lowercase("Α'Σ Α"), "α'ς α");
assert_eq!(to_lowercase("Α''Σ Α"), "α''ς α");
assert_eq!(to_lowercase("ΑΣ' Α"), "ας' α");
assert_eq!(to_lowercase("ΑΣ'' Α"), "ας'' α");
assert_eq!(to_lowercase("Α'Σ' Α"), "α'ς' α");
assert_eq!(to_lowercase("Α''Σ'' Α"), "α''ς'' α");
assert_eq!(to_lowercase("Α Σ"), "α σ");
assert_eq!(to_lowercase("Α 'Σ"), "α 'σ");
assert_eq!(to_lowercase("Α ''Σ"), "α ''σ");
assert_eq!(to_lowercase("Σ"), "σ");
assert_eq!(to_lowercase("'Σ"), "'σ");
assert_eq!(to_lowercase("''Σ"), "''σ");
assert_eq!(to_lowercase("ΑΣΑ"), "ασα");
assert_eq!(to_lowercase("ΑΣ'Α"), "ασ'α");
assert_eq!(to_lowercase("ΑΣ''Α"), "ασ''α");
}
#[test]
fn new_to_uppercase() {
assert_eq!(to_uppercase(""), "");
assert_eq!(to_uppercase("aéDžßfiᾀ"), "AÉDŽSSFIἈΙ");
}
#[bench]
fn upper_new_really_bad(b: &mut Bencher) {
b.iter(|| to_uppercase("aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA"));
}
#[bench]
fn upper_new_worst(b: &mut Bencher) {
b.iter(|| to_uppercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}
#[bench]
fn upper_new_mixed(b: &mut Bencher) {
b.iter(|| to_uppercase("AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa"));
}
#[bench]
fn upper_new_best(b: &mut Bencher) {
b.iter(|| to_uppercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}
#[bench]
fn upper_new_unicode(b: &mut Bencher) {
b.iter(|| to_uppercase("AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA"));
}
#[bench]
fn upper_new_unicode2(b: &mut Bencher) {
b.iter(|| to_uppercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}
#[bench]
fn upper_std_really_bad(b: &mut Bencher) {
b.iter(|| "aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA".to_uppercase());
}
#[bench]
fn upper_std_worst(b: &mut Bencher) {
b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_uppercase());
}
#[bench]
fn upper_std_mixed(b: &mut Bencher) {
b.iter(|| "AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa".to_uppercase());
}
#[bench]
fn upper_std_unicode(b: &mut Bencher) {
b.iter(|| "AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA".to_uppercase());
}
#[bench]
fn upper_std_unicode2(b: &mut Bencher) {
b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_uppercase());
}
#[bench]
fn upper_std_best(b: &mut Bencher) {
b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_uppercase());
}
#[bench]
fn lower_new_really_bad(b: &mut Bencher) {
b.iter(|| to_lowercase("AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa"));
}
#[bench]
fn lower_new_worst(b: &mut Bencher) {
b.iter(|| to_lowercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}
#[bench]
fn lower_new_mixed(b: &mut Bencher) {
b.iter(|| to_lowercase("aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA"));
}
#[bench]
fn lower_new_best(b: &mut Bencher) {
b.iter(|| to_lowercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}
#[bench]
fn lower_new_unicode(b: &mut Bencher) {
b.iter(|| to_lowercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}
#[bench]
fn lower_new_unicode2(b: &mut Bencher) {
b.iter(|| to_lowercase("AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA"));
}
#[bench]
fn lower_std_really_bad(b: &mut Bencher) {
b.iter(|| "AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa".to_lowercase());
}
#[bench]
fn lower_std_worst(b: &mut Bencher) {
b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_lowercase());
}
#[bench]
fn lower_std_mixed(b: &mut Bencher) {
b.iter(|| "aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA".to_lowercase());
}
#[bench]
fn lower_std_unicode(b: &mut Bencher) {
b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_lowercase());
}
#[bench]
fn lower_std_unicode2(b: &mut Bencher) {
b.iter(|| "AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA".to_lowercase());
}
#[bench]
fn lower_std_best(b: &mut Bencher) {
b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_lowercase());
} |
Looks like an unrelated failure on travis. |
|
// result or extend with the lower case version if an upper case | ||
// character is found. | ||
for (i, ch) in self.char_indices() { | ||
if !ch.is_lowercase() && ch.is_alphabetic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this perhaps be ch != ch.to_lowercase()
?
I'm wary about saying that !ch.is_lowercase()
is equivalent to the logic above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch.to_lowercase
would return an Iterator
, so I'd have to create it, make it peekable, check the first character is different and then use it.
The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.
If someone with Unicode knowledge can chime in it would be nice, I'm wary about it too.
From what I understand non-alphabetic characters can't be turned to lower case, and the logic is that some characters (title case and non-alphabetic ones) are neither upper case nor lower case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @SimonSapin, tweaks in the behavior of to_lowercase
here.
The optimization here is to collect up chunks of a string that are entirely lowercase and then push it all onto the string all at once (instead of one at a time). The logic is to instead test if each character is lowercase already and just maintain some indexes if that's the case.
We're worried though that this may not produce the same results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand non-alphabetic characters can't be turned to lower case,
Unicode is full of surprising corner cases, so I wouldn’t rely on something like this on faith.
I’m opposed to skipping char::to_lowercase
unless you make src/etc/unicode.py
check exhaustively that it returns its input unchanged for ever code point where you would skip it, and that this stays the case when we updated to a new Unicode version.
The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.
Here are a few random ideas to improved this.
char::to_lowercase
is implemented in src/librustc_unicode/tables.rs
with a binary search, while char::is_lowercase
and char::is_alphabetic
use a trie of bits and have a fast-path for ASCII.
char::to_lowercase
could use a similar BoolTrie
to skip the binary search for code points that are unchanged (thought this would have a cost in binary size) and have its own ASCII fast path.
to_lowercase_table
could be changed to store [u8; 9]
(or whatever is the maximum length) in UTF-8 rather than [char; 3]
in (effectively) UTF-32, so that pushing to a String
doesn’t need to do an encoding conversion.
If doing larger copies from &self
turns out to be significantly more efficient than pushing code points one (or up to three) at a time, the ToLowercase
iterator could be extended to implement PartialEq<char>
or equivalent, to find out if the conversion was a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be possible to replace the binary searches entirely, with tries. I’ll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, tries for case mapping is not as easy as I first thought: we’d need to play some more tricks to keep it somewhat space-efficient, such as encoding ranges of code points that map to a single code point at a constant offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding PartialEq<char>
to ToLowercase
and ToUppercase
would already allow the optimization to work.
From what I tested the performance improvement comes from using push_str
instead of extend
, because the push_str
gets optimized to a memcpy
, so the bigger the already properly cased substring is the faster it gets.
I can look into adding the PartialEq<char>
impl and using it here if you want.
Closing due to inactivity, but feel free to resubmit with comments addressed! |
@alexcrichton I was actually waiting for @SimonSapin to reply 🐼 |
ah in that case, ping @SimonSapin, I believe about this comment |
@meh I’m not sure what response you’re expecting as I don’t see a question. I’ve said that I’m not confident that the patch as-is (using |
I was working on a crate to handle string casing and it seems the
to_lowercase
andto_uppercase
implementations I came up with (excluding the fact the stuff in my crate returnsCow<Self>
) is faster than the one in libstd in most cases, and a little slower in the worst case (i.e. all uppercase when callingto_lowercase
or the other way around).I ran the libstd tests and it was all green locally, but it doesn't look like there any test cases for these two functions, unless I'm blind 🐼
Below are the benchmarks (which probably aren't testing any real world scenario cases) I ran locally, and the code to run them.