Improve performance of str::to_lowercase and str::to_uppercase #36127

meh · 2016-08-30T01:32:17Z

I was working on a crate to handle string casing and it seems the to_lowercase and to_uppercase implementations I came up with (excluding the fact the stuff in my crate returns Cow<Self>) is faster than the one in libstd in most cases, and a little slower in the worst case (i.e. all uppercase when calling to_lowercase or the other way around).

I ran the libstd tests and it was all green locally, but it doesn't look like there any test cases for these two functions, unless I'm blind 🐼

Below are the benchmarks (which probably aren't testing any real world scenario cases) I ran locally, and the code to run them.

test lower_new_best     ... bench:         208 ns/iter (+/- 14)
test lower_new_mixed    ... bench:       1,216 ns/iter (+/- 77)
test lower_new_unicode  ... bench:         644 ns/iter (+/- 49)
test lower_new_unicode2 ... bench:       1,642 ns/iter (+/- 218)
test lower_new_worst    ... bench:       1,821 ns/iter (+/- 149)
test lower_std_best     ... bench:       1,840 ns/iter (+/- 263)
test lower_std_mixed    ... bench:       1,808 ns/iter (+/- 256)
test lower_std_unicode  ... bench:       1,941 ns/iter (+/- 106)
test lower_std_unicode2 ... bench:       1,918 ns/iter (+/- 94)
test lower_std_worst    ... bench:       1,798 ns/iter (+/- 143)
test upper_new_best     ... bench:         231 ns/iter (+/- 47)
test upper_new_mixed    ... bench:       1,206 ns/iter (+/- 65)
test upper_new_unicode  ... bench:         724 ns/iter (+/- 164)
test upper_new_unicode2 ... bench:       1,566 ns/iter (+/- 128)
test upper_new_worst    ... bench:       1,838 ns/iter (+/- 120)
test upper_std_best     ... bench:       1,669 ns/iter (+/- 155)
test upper_std_mixed    ... bench:       1,720 ns/iter (+/- 166)
test upper_std_unicode  ... bench:       1,820 ns/iter (+/- 133)
test upper_std_unicode2 ... bench:       1,797 ns/iter (+/- 146)
test upper_std_worst    ... bench:       1,669 ns/iter (+/- 110)

#![feature(test, unicode)]
extern crate test;
use test::Bencher;

extern crate rustc_unicode;

fn main() {
    println!("Hello, world!");
}

fn to_uppercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of upper case characters to push into the
    // result or extend with the upper case version if a lower case
    // character is found.
    for (i, ch) in this.char_indices() {
        if ch.is_lowercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            s.extend(ch.to_uppercase());
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    s
}

fn to_lowercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of lower case characters to push into the
    // result or extend with the lower case version if an upper case
    // character is found.
    for (i, ch) in this.char_indices() {
        if ch.is_uppercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            if ch == 'Σ' {
                // Σ maps to σ, except at the end of a word where it maps to ς.
                // This is the only conditional (contextual) but language-independent mapping
                // in `SpecialCasing.txt`,
                // so hard-code it rather than have a generic "condition" mechanism.
                // See https://github.com/rust-lang/rust/issues/26035

                map_uppercase_sigma(this, i, &mut s);
            }
            else {
                s.extend(ch.to_lowercase());
            }
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    return s;

    fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
        // See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
        // for the definition of `Final_Sigma`.
        debug_assert!('Σ'.len_utf8() == 2);
        let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
                           !case_ignoreable_then_cased(from[i + 2..].chars());

        to.push(if is_word_final {
                'ς'
        } else {
                'σ'
        });
    }

    fn case_ignoreable_then_cased<I: Iterator<Item = char>>(iter: I) -> bool {
        use rustc_unicode::derived_property::{Cased, Case_Ignorable};
        match iter.skip_while(|&c| Case_Ignorable(c)).next() {
            Some(c) => Cased(c),
            None => false,
        }
    }
}

#[bench]
fn upper_new_worst(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_new_mixed(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa"));
}

#[bench]
fn upper_new_best(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_std_worst(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_mixed(b: &mut Bencher) {
    b.iter(|| "AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa".to_uppercase());
}

#[bench]
fn upper_std_unicode(b: &mut Bencher) {
    b.iter(|| "AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn upper_std_unicode2(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_best(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn lower_new_worst(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_new_mixed(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA"));
}

#[bench]
fn lower_new_best(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAÐªAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_std_worst(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_mixed(b: &mut Bencher) {
    b.iter(|| "aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA".to_lowercase());
}

#[bench]
fn lower_std_unicode(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_lowercase());
}

#[bench]
fn lower_std_unicode2(b: &mut Bencher) {
    b.iter(|| "AAAÐªAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_best(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_lowercase());
}

rust-highfive · 2016-08-30T01:32:31Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

alexcrichton · 2016-08-30T17:06:20Z

Thanks for the PR! Can you also be sure to add a few test cases for some of the more tricky corner cases this needs to handle as well?

meh · 2016-08-30T19:37:48Z

Where should I put them? I'm still not sure where the tests for stuff are supposed to go.

Do you have any specific examples you're concerned about? The only actual special case that was in the previous code is the final sigma one, which is basically copied as is.

bluss · 2016-08-31T11:16:41Z

There exists tests in src/libcollectionstest/str.rs already, check if they cover the special cases already (I think they do).

meh · 2016-09-01T21:48:48Z

The failing test is confusing, but my Unicode knowledge doesn't go that far.

If anyone could enlighten me, these characters Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ seem to have weird behavior with is_uppercase and is_lowercase, I could add an exception for it but I'm not sure there aren't any others with this behavior.

frewsxcv · 2016-09-01T22:08:00Z

From what I understand, unicode classifies ~~characters~~ letters as: uppercase, lowercase, and titlecase

Ǆ (uppercase): is_uppercase=true, is_lowercase=false
ǅ (titlecase): is_uppercase=false, is_lowercase=false
ǆ (lowercase): is_uppercase=false, is_lowercase=true

frewsxcv · 2016-09-01T22:13:58Z

So in this test, which is currently failing:

assert_eq!("AÉǅaé ".to_lowercase(), "aéǆaé ");

It's expecting ǅ (titlecase) to be translated into its lowercase variant

meh · 2016-09-01T22:59:38Z

@frewsxcv makes sense, thanks a bunch.

Updated benchmark results:

test lower_new_best       ... bench:         203 ns/iter (+/- 18)
test lower_new_mixed      ... bench:       1,231 ns/iter (+/- 274)
test lower_new_really_bad ... bench:       1,112 ns/iter (+/- 145)
test lower_new_unicode    ... bench:         665 ns/iter (+/- 107)
test lower_new_unicode2   ... bench:       1,661 ns/iter (+/- 168)
test lower_new_worst      ... bench:       1,937 ns/iter (+/- 632)
test lower_std_best       ... bench:       1,866 ns/iter (+/- 189)
test lower_std_mixed      ... bench:       1,819 ns/iter (+/- 232)
test lower_std_really_bad ... bench:       1,859 ns/iter (+/- 160)
test lower_std_unicode    ... bench:       2,005 ns/iter (+/- 230)
test lower_std_unicode2   ... bench:       2,005 ns/iter (+/- 310)
test lower_std_worst      ... bench:       1,829 ns/iter (+/- 172)
test upper_new_best       ... bench:         210 ns/iter (+/- 116)
test upper_new_mixed      ... bench:       1,214 ns/iter (+/- 221)
test upper_new_really_bad ... bench:       1,101 ns/iter (+/- 112)
test upper_new_unicode    ... bench:         753 ns/iter (+/- 444)
test upper_new_unicode2   ... bench:       1,619 ns/iter (+/- 240)
test upper_new_worst      ... bench:       1,870 ns/iter (+/- 160)
test upper_std_best       ... bench:       1,715 ns/iter (+/- 261)
test upper_std_mixed      ... bench:       1,744 ns/iter (+/- 134)
test upper_std_really_bad ... bench:       1,714 ns/iter (+/- 208)
test upper_std_unicode    ... bench:       1,850 ns/iter (+/- 250)
test upper_std_unicode2   ... bench:       1,817 ns/iter (+/- 224)
test upper_std_worst      ... bench:       1,736 ns/iter (+/- 265)

Code for them:

#![feature(test, unicode)]
extern crate test;
use test::Bencher;

extern crate rustc_unicode;

fn main() {
    println!("Hello, world!");
}

fn to_uppercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of upper case characters to push into the
    // result or extend with the upper case version if a lower case
    // character is found.
    for (i, ch) in this.char_indices() {
        if !ch.is_uppercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            s.extend(ch.to_uppercase());
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    s
}

fn to_lowercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of lower case characters to push into the
    // result or extend with the lower case version if an upper case
    // character is found.
    for (i, ch) in this.char_indices() {
        if !ch.is_lowercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            if ch == 'Σ' {
                // Σ maps to σ, except at the end of a word where it maps to ς.
                // This is the only conditional (contextual) but language-independent mapping
                // in `SpecialCasing.txt`,
                // so hard-code it rather than have a generic "condition" mechanism.
                // See https://github.com/rust-lang/rust/issues/26035

                map_uppercase_sigma(this, i, &mut s);
            }
            else {
                s.extend(ch.to_lowercase());
            }
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    return s;

    fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
        // See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
        // for the definition of `Final_Sigma`.
        debug_assert!('Σ'.len_utf8() == 2);
        let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
                           !case_ignoreable_then_cased(from[i + 2..].chars());

        to.push(if is_word_final {
                'ς'
        } else {
                'σ'
        });
    }

    fn case_ignoreable_then_cased<I: Iterator<Item = char>>(iter: I) -> bool {
        use rustc_unicode::derived_property::{Cased, Case_Ignorable};
        match iter.skip_while(|&c| Case_Ignorable(c)).next() {
            Some(c) => Cased(c),
            None => false,
        }
    }
}

#[test]
fn std_to_lowercase() {
    assert_eq!("".to_lowercase(), "");
    assert_eq!("AÉǅaé ".to_lowercase(), "aéǆaé ");

    // https://github.com/rust-lang/rust/issues/26035
    assert_eq!("ΑΣ".to_lowercase(), "ας");
    assert_eq!("Α'Σ".to_lowercase(), "α'ς");
    assert_eq!("Α''Σ".to_lowercase(), "α''ς");

    assert_eq!("ΑΣ Α".to_lowercase(), "ας α");
    assert_eq!("Α'Σ Α".to_lowercase(), "α'ς α");
    assert_eq!("Α''Σ Α".to_lowercase(), "α''ς α");

    assert_eq!("ΑΣ' Α".to_lowercase(), "ας' α");
    assert_eq!("ΑΣ'' Α".to_lowercase(), "ας'' α");

    assert_eq!("Α'Σ' Α".to_lowercase(), "α'ς' α");
    assert_eq!("Α''Σ'' Α".to_lowercase(), "α''ς'' α");

    assert_eq!("Α Σ".to_lowercase(), "α σ");
    assert_eq!("Α 'Σ".to_lowercase(), "α 'σ");
    assert_eq!("Α ''Σ".to_lowercase(), "α ''σ");

    assert_eq!("Σ".to_lowercase(), "σ");
    assert_eq!("'Σ".to_lowercase(), "'σ");
    assert_eq!("''Σ".to_lowercase(), "''σ");

    assert_eq!("ΑΣΑ".to_lowercase(), "ασα");
    assert_eq!("ΑΣ'Α".to_lowercase(), "ασ'α");
    assert_eq!("ΑΣ''Α".to_lowercase(), "ασ''α");
}

#[test]
fn std_to_uppercase() {
    assert_eq!("".to_uppercase(), "");
    assert_eq!("aéǅßﬁᾀ".to_uppercase(), "AÉǄSSFIἈΙ");
}

#[test]
fn new_to_lowercase() {
    assert_eq!(to_lowercase(""), "");
    assert_eq!(to_lowercase("AÉǅaé "), "aéǆaé ");

    // https://github.com/rust-lang/rust/issues/26035
    assert_eq!(to_lowercase("ΑΣ"), "ας");
    assert_eq!(to_lowercase("Α'Σ"), "α'ς");
    assert_eq!(to_lowercase("Α''Σ"), "α''ς");

    assert_eq!(to_lowercase("ΑΣ Α"), "ας α");
    assert_eq!(to_lowercase("Α'Σ Α"), "α'ς α");
    assert_eq!(to_lowercase("Α''Σ Α"), "α''ς α");

    assert_eq!(to_lowercase("ΑΣ' Α"), "ας' α");
    assert_eq!(to_lowercase("ΑΣ'' Α"), "ας'' α");

    assert_eq!(to_lowercase("Α'Σ' Α"), "α'ς' α");
    assert_eq!(to_lowercase("Α''Σ'' Α"), "α''ς'' α");

    assert_eq!(to_lowercase("Α Σ"), "α σ");
    assert_eq!(to_lowercase("Α 'Σ"), "α 'σ");
    assert_eq!(to_lowercase("Α ''Σ"), "α ''σ");

    assert_eq!(to_lowercase("Σ"), "σ");
    assert_eq!(to_lowercase("'Σ"), "'σ");
    assert_eq!(to_lowercase("''Σ"), "''σ");

    assert_eq!(to_lowercase("ΑΣΑ"), "ασα");
    assert_eq!(to_lowercase("ΑΣ'Α"), "ασ'α");
    assert_eq!(to_lowercase("ΑΣ''Α"), "ασ''α");
}

#[test]
fn new_to_uppercase() {
    assert_eq!(to_uppercase(""), "");
    assert_eq!(to_uppercase("aéǅßﬁᾀ"), "AÉǄSSFIἈΙ");
}

#[bench]
fn upper_new_really_bad(b: &mut Bencher) {
    b.iter(|| to_uppercase("aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA"));
}

#[bench]
fn upper_new_worst(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_new_mixed(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa"));
}

#[bench]
fn upper_new_best(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_std_really_bad(b: &mut Bencher) {
    b.iter(|| "aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA".to_uppercase());
}

#[bench]
fn upper_std_worst(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_mixed(b: &mut Bencher) {
    b.iter(|| "AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa".to_uppercase());
}

#[bench]
fn upper_std_unicode(b: &mut Bencher) {
    b.iter(|| "AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn upper_std_unicode2(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_best(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn lower_new_really_bad(b: &mut Bencher) {
    b.iter(|| to_lowercase("AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa"));
}

#[bench]
fn lower_new_worst(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_new_mixed(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA"));
}

#[bench]
fn lower_new_best(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAÐªAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_std_really_bad(b: &mut Bencher) {
    b.iter(|| "AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa".to_lowercase());
}

#[bench]
fn lower_std_worst(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_mixed(b: &mut Bencher) {
    b.iter(|| "aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA".to_lowercase());
}

#[bench]
fn lower_std_unicode(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_lowercase());
}

#[bench]
fn lower_std_unicode2(b: &mut Bencher) {
    b.iter(|| "AAAÐªAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_best(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_lowercase());
}

meh · 2016-09-02T02:02:39Z

Looks like an unrelated failure on travis.

frewsxcv · 2016-09-02T02:15:58Z

Looks like an unrelated failure on travis.

#36138

alexcrichton · 2016-09-03T05:48:33Z

src/libcollections/str.rs

+        // result or extend with the lower case version if an upper case
+        // character is found.
+        for (i, ch) in self.char_indices() {
+            if !ch.is_lowercase() && ch.is_alphabetic() {


Could this perhaps be ch != ch.to_lowercase()?

I'm wary about saying that !ch.is_lowercase() is equivalent to the logic above...

ch.to_lowercase would return an Iterator, so I'd have to create it, make it peekable, check the first character is different and then use it.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

If someone with Unicode knowledge can chime in it would be nice, I'm wary about it too.

From what I understand non-alphabetic characters can't be turned to lower case, and the logic is that some characters (title case and non-alphabetic ones) are neither upper case nor lower case.

cc @SimonSapin, tweaks in the behavior of to_lowercase here.

The optimization here is to collect up chunks of a string that are entirely lowercase and then push it all onto the string all at once (instead of one at a time). The logic is to instead test if each character is lowercase already and just maintain some indexes if that's the case.

We're worried though that this may not produce the same results?

From what I understand non-alphabetic characters can't be turned to lower case,

Unicode is full of surprising corner cases, so I wouldn’t rely on something like this on faith.

I’m opposed to skipping char::to_lowercase unless you make src/etc/unicode.py check exhaustively that it returns its input unchanged for ever code point where you would skip it, and that this stays the case when we updated to a new Unicode version.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

Here are a few random ideas to improved this.

char::to_lowercase is implemented in src/librustc_unicode/tables.rs with a binary search, while char::is_lowercase and char::is_alphabetic use a trie of bits and have a fast-path for ASCII.

char::to_lowercase could use a similar BoolTrie to skip the binary search for code points that are unchanged (thought this would have a cost in binary size) and have its own ASCII fast path.

to_lowercase_table could be changed to store [u8; 9] (or whatever is the maximum length) in UTF-8 rather than [char; 3] in (effectively) UTF-32, so that pushing to a String doesn’t need to do an encoding conversion.

If doing larger copies from &self turns out to be significantly more efficient than pushing code points one (or up to three) at a time, the ToLowercase iterator could be extended to implement PartialEq<char> or equivalent, to find out if the conversion was a no-op.

It might also be possible to replace the binary searches entirely, with tries. I’ll look into that.

Ok, tries for case mapping is not as easy as I first thought: we’d need to play some more tricks to keep it somewhat space-efficient, such as encoding ranges of code points that map to a single code point at a constant offset.

I think adding PartialEq<char> to ToLowercase and ToUppercase would already allow the optimization to work.

From what I tested the performance improvement comes from using push_str instead of extend, because the push_str gets optimized to a memcpy, so the bigger the already properly cased substring is the faster it gets.

I can look into adding the PartialEq<char> impl and using it here if you want.

alexcrichton · 2016-10-31T20:33:17Z

Closing due to inactivity, but feel free to resubmit with comments addressed!

meh · 2016-10-31T20:35:54Z

@alexcrichton I was actually waiting for @SimonSapin to reply 🐼

alexcrichton · 2016-10-31T20:53:05Z

ah in that case, ping @SimonSapin, I believe about this comment

SimonSapin · 2016-10-31T22:27:19Z

@meh I’m not sure what response you’re expecting as I don’t see a question. I’ve said that I’m not confident that the patch as-is (using is_alphabetic) is correct, and suggested some alternatives.

rust-highfive assigned alexcrichton Aug 30, 2016

meh added 2 commits September 2, 2016 00:55

Improve performance of str::to_lowercase and str::to_uppercase

f587884

Fix handling of title-case characters

33322c8

meh force-pushed the faster-upperlowercase branch from dfad156 to 33322c8 Compare September 1, 2016 22:57

Avoid changing case on symbols

7452bbd

alexcrichton reviewed Sep 3, 2016
View reviewed changes

alexcrichton closed this Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of str::to_lowercase and str::to_uppercase #36127

Improve performance of str::to_lowercase and str::to_uppercase #36127

meh commented Aug 30, 2016

rust-highfive commented Aug 30, 2016

alexcrichton commented Aug 30, 2016

meh commented Aug 30, 2016

bluss commented Aug 31, 2016

meh commented Sep 1, 2016

frewsxcv commented Sep 1, 2016 •

edited

Loading

frewsxcv commented Sep 1, 2016

meh commented Sep 1, 2016

meh commented Sep 2, 2016

frewsxcv commented Sep 2, 2016

alexcrichton Sep 3, 2016

meh Sep 3, 2016

alexcrichton Sep 6, 2016

SimonSapin Sep 6, 2016

SimonSapin Sep 7, 2016

SimonSapin Sep 7, 2016

meh Sep 7, 2016

alexcrichton commented Oct 31, 2016

meh commented Oct 31, 2016

alexcrichton commented Oct 31, 2016

SimonSapin commented Oct 31, 2016

Improve performance of str::to_lowercase and str::to_uppercase #36127

Improve performance of str::to_lowercase and str::to_uppercase #36127

Conversation

meh commented Aug 30, 2016

rust-highfive commented Aug 30, 2016

alexcrichton commented Aug 30, 2016

meh commented Aug 30, 2016

bluss commented Aug 31, 2016

meh commented Sep 1, 2016

frewsxcv commented Sep 1, 2016 • edited Loading

frewsxcv commented Sep 1, 2016

meh commented Sep 1, 2016

meh commented Sep 2, 2016

frewsxcv commented Sep 2, 2016

alexcrichton Sep 3, 2016

Choose a reason for hiding this comment

meh Sep 3, 2016

Choose a reason for hiding this comment

alexcrichton Sep 6, 2016

Choose a reason for hiding this comment

SimonSapin Sep 6, 2016

Choose a reason for hiding this comment

SimonSapin Sep 7, 2016

Choose a reason for hiding this comment

SimonSapin Sep 7, 2016

Choose a reason for hiding this comment

meh Sep 7, 2016

Choose a reason for hiding this comment

alexcrichton commented Oct 31, 2016

meh commented Oct 31, 2016

alexcrichton commented Oct 31, 2016

SimonSapin commented Oct 31, 2016

frewsxcv commented Sep 1, 2016 •

edited

Loading