Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of str::to_lowercase and str::to_uppercase #36127

Closed
wants to merge 3 commits into from

Conversation

meh
Copy link
Contributor

@meh meh commented Aug 30, 2016

I was working on a crate to handle string casing and it seems the to_lowercase and to_uppercase implementations I came up with (excluding the fact the stuff in my crate returns Cow<Self>) is faster than the one in libstd in most cases, and a little slower in the worst case (i.e. all uppercase when calling to_lowercase or the other way around).

I ran the libstd tests and it was all green locally, but it doesn't look like there any test cases for these two functions, unless I'm blind 🐼

Below are the benchmarks (which probably aren't testing any real world scenario cases) I ran locally, and the code to run them.

test lower_new_best     ... bench:         208 ns/iter (+/- 14)
test lower_new_mixed    ... bench:       1,216 ns/iter (+/- 77)
test lower_new_unicode  ... bench:         644 ns/iter (+/- 49)
test lower_new_unicode2 ... bench:       1,642 ns/iter (+/- 218)
test lower_new_worst    ... bench:       1,821 ns/iter (+/- 149)
test lower_std_best     ... bench:       1,840 ns/iter (+/- 263)
test lower_std_mixed    ... bench:       1,808 ns/iter (+/- 256)
test lower_std_unicode  ... bench:       1,941 ns/iter (+/- 106)
test lower_std_unicode2 ... bench:       1,918 ns/iter (+/- 94)
test lower_std_worst    ... bench:       1,798 ns/iter (+/- 143)
test upper_new_best     ... bench:         231 ns/iter (+/- 47)
test upper_new_mixed    ... bench:       1,206 ns/iter (+/- 65)
test upper_new_unicode  ... bench:         724 ns/iter (+/- 164)
test upper_new_unicode2 ... bench:       1,566 ns/iter (+/- 128)
test upper_new_worst    ... bench:       1,838 ns/iter (+/- 120)
test upper_std_best     ... bench:       1,669 ns/iter (+/- 155)
test upper_std_mixed    ... bench:       1,720 ns/iter (+/- 166)
test upper_std_unicode  ... bench:       1,820 ns/iter (+/- 133)
test upper_std_unicode2 ... bench:       1,797 ns/iter (+/- 146)
test upper_std_worst    ... bench:       1,669 ns/iter (+/- 110)
#![feature(test, unicode)]
extern crate test;
use test::Bencher;

extern crate rustc_unicode;

fn main() {
    println!("Hello, world!");
}

fn to_uppercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of upper case characters to push into the
    // result or extend with the upper case version if a lower case
    // character is found.
    for (i, ch) in this.char_indices() {
        if ch.is_lowercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            s.extend(ch.to_uppercase());
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    s
}

fn to_lowercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of lower case characters to push into the
    // result or extend with the lower case version if an upper case
    // character is found.
    for (i, ch) in this.char_indices() {
        if ch.is_uppercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            if ch == 'Σ' {
                // Σ maps to σ, except at the end of a word where it maps to ς.
                // This is the only conditional (contextual) but language-independent mapping
                // in `SpecialCasing.txt`,
                // so hard-code it rather than have a generic "condition" mechanism.
                // See https://github.com/rust-lang/rust/issues/26035

                map_uppercase_sigma(this, i, &mut s);
            }
            else {
                s.extend(ch.to_lowercase());
            }
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    return s;

    fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
        // See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
        // for the definition of `Final_Sigma`.
        debug_assert!('Σ'.len_utf8() == 2);
        let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
                           !case_ignoreable_then_cased(from[i + 2..].chars());

        to.push(if is_word_final {
                'ς'
        } else {
                'σ'
        });
    }

    fn case_ignoreable_then_cased<I: Iterator<Item = char>>(iter: I) -> bool {
        use rustc_unicode::derived_property::{Cased, Case_Ignorable};
        match iter.skip_while(|&c| Case_Ignorable(c)).next() {
            Some(c) => Cased(c),
            None => false,
        }
    }
}

#[bench]
fn upper_new_worst(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_new_mixed(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa"));
}

#[bench]
fn upper_new_best(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_std_worst(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_mixed(b: &mut Bencher) {
    b.iter(|| "AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa".to_uppercase());
}

#[bench]
fn upper_std_unicode(b: &mut Bencher) {
    b.iter(|| "AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn upper_std_unicode2(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_best(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn lower_new_worst(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_new_mixed(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA"));
}

#[bench]
fn lower_new_best(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_std_worst(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_mixed(b: &mut Bencher) {
    b.iter(|| "aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA".to_lowercase());
}

#[bench]
fn lower_std_unicode(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_lowercase());
}

#[bench]
fn lower_std_unicode2(b: &mut Bencher) {
    b.iter(|| "AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_best(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_lowercase());
}

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@alexcrichton
Copy link
Member

Thanks for the PR! Can you also be sure to add a few test cases for some of the more tricky corner cases this needs to handle as well?

@meh
Copy link
Contributor Author

meh commented Aug 30, 2016

Where should I put them? I'm still not sure where the tests for stuff are supposed to go.

Do you have any specific examples you're concerned about? The only actual special case that was in the previous code is the final sigma one, which is basically copied as is.

@bluss
Copy link
Member

bluss commented Aug 31, 2016

There exists tests in src/libcollectionstest/str.rs already, check if they cover the special cases already (I think they do).

@meh
Copy link
Contributor Author

meh commented Sep 1, 2016

The failing test is confusing, but my Unicode knowledge doesn't go that far.

If anyone could enlighten me, these characters DŽ Dž dž LJ Lj lj NJ Nj nj seem to have weird behavior with is_uppercase and is_lowercase, I could add an exception for it but I'm not sure there aren't any others with this behavior.

@frewsxcv
Copy link
Member

frewsxcv commented Sep 1, 2016

From what I understand, unicode classifies characters letters as: uppercase, lowercase, and titlecase

@frewsxcv
Copy link
Member

frewsxcv commented Sep 1, 2016

So in this test, which is currently failing:

assert_eq!("AÉDžaé ".to_lowercase(), "aédžaé ");

It's expecting Dž (titlecase) to be translated into its lowercase variant

@meh meh force-pushed the faster-upperlowercase branch from dfad156 to 33322c8 Compare September 1, 2016 22:57
@meh
Copy link
Contributor Author

meh commented Sep 1, 2016

@frewsxcv makes sense, thanks a bunch.

Updated benchmark results:

test lower_new_best       ... bench:         203 ns/iter (+/- 18)
test lower_new_mixed      ... bench:       1,231 ns/iter (+/- 274)
test lower_new_really_bad ... bench:       1,112 ns/iter (+/- 145)
test lower_new_unicode    ... bench:         665 ns/iter (+/- 107)
test lower_new_unicode2   ... bench:       1,661 ns/iter (+/- 168)
test lower_new_worst      ... bench:       1,937 ns/iter (+/- 632)
test lower_std_best       ... bench:       1,866 ns/iter (+/- 189)
test lower_std_mixed      ... bench:       1,819 ns/iter (+/- 232)
test lower_std_really_bad ... bench:       1,859 ns/iter (+/- 160)
test lower_std_unicode    ... bench:       2,005 ns/iter (+/- 230)
test lower_std_unicode2   ... bench:       2,005 ns/iter (+/- 310)
test lower_std_worst      ... bench:       1,829 ns/iter (+/- 172)
test upper_new_best       ... bench:         210 ns/iter (+/- 116)
test upper_new_mixed      ... bench:       1,214 ns/iter (+/- 221)
test upper_new_really_bad ... bench:       1,101 ns/iter (+/- 112)
test upper_new_unicode    ... bench:         753 ns/iter (+/- 444)
test upper_new_unicode2   ... bench:       1,619 ns/iter (+/- 240)
test upper_new_worst      ... bench:       1,870 ns/iter (+/- 160)
test upper_std_best       ... bench:       1,715 ns/iter (+/- 261)
test upper_std_mixed      ... bench:       1,744 ns/iter (+/- 134)
test upper_std_really_bad ... bench:       1,714 ns/iter (+/- 208)
test upper_std_unicode    ... bench:       1,850 ns/iter (+/- 250)
test upper_std_unicode2   ... bench:       1,817 ns/iter (+/- 224)
test upper_std_worst      ... bench:       1,736 ns/iter (+/- 265)

Code for them:

#![feature(test, unicode)]
extern crate test;
use test::Bencher;

extern crate rustc_unicode;

fn main() {
    println!("Hello, world!");
}

fn to_uppercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of upper case characters to push into the
    // result or extend with the upper case version if a lower case
    // character is found.
    for (i, ch) in this.char_indices() {
        if !ch.is_uppercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            s.extend(ch.to_uppercase());
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    s
}

fn to_lowercase(this: &str) -> String {
    let mut s = String::with_capacity(this.len());
    let mut left = None;

    // Try to collect slices of lower case characters to push into the
    // result or extend with the lower case version if an upper case
    // character is found.
    for (i, ch) in this.char_indices() {
        if !ch.is_lowercase() {
            if let Some(offset) = left.take() {
                s.push_str(&this[offset..i]);
            }

            if ch == 'Σ' {
                // Σ maps to σ, except at the end of a word where it maps to ς.
                // This is the only conditional (contextual) but language-independent mapping
                // in `SpecialCasing.txt`,
                // so hard-code it rather than have a generic "condition" mechanism.
                // See https://github.com/rust-lang/rust/issues/26035

                map_uppercase_sigma(this, i, &mut s);
            }
            else {
                s.extend(ch.to_lowercase());
            }
        }
        else if left.is_none() {
            left = Some(i);
        }
    }

    // Append any leftover upper case characters.
    if let Some(offset) = left.take() {
        s.push_str(&this[offset..]);
    }

    return s;

    fn map_uppercase_sigma(from: &str, i: usize, to: &mut String) {
        // See http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G33992
        // for the definition of `Final_Sigma`.
        debug_assert!('Σ'.len_utf8() == 2);
        let is_word_final = case_ignoreable_then_cased(from[..i].chars().rev()) &&
                           !case_ignoreable_then_cased(from[i + 2..].chars());

        to.push(if is_word_final {
                'ς'
        } else {
                'σ'
        });
    }

    fn case_ignoreable_then_cased<I: Iterator<Item = char>>(iter: I) -> bool {
        use rustc_unicode::derived_property::{Cased, Case_Ignorable};
        match iter.skip_while(|&c| Case_Ignorable(c)).next() {
            Some(c) => Cased(c),
            None => false,
        }
    }
}

#[test]
fn std_to_lowercase() {
    assert_eq!("".to_lowercase(), "");
    assert_eq!("AÉDžaé ".to_lowercase(), "aédžaé ");

    // https://github.com/rust-lang/rust/issues/26035
    assert_eq!("ΑΣ".to_lowercase(), "ας");
    assert_eq!("Α'Σ".to_lowercase(), "α'ς");
    assert_eq!("Α''Σ".to_lowercase(), "α''ς");

    assert_eq!("ΑΣ Α".to_lowercase(), "ας α");
    assert_eq!("Α'Σ Α".to_lowercase(), "α'ς α");
    assert_eq!("Α''Σ Α".to_lowercase(), "α''ς α");

    assert_eq!("ΑΣ' Α".to_lowercase(), "ας' α");
    assert_eq!("ΑΣ'' Α".to_lowercase(), "ας'' α");

    assert_eq!("Α'Σ' Α".to_lowercase(), "α'ς' α");
    assert_eq!("Α''Σ'' Α".to_lowercase(), "α''ς'' α");

    assert_eq!("Α Σ".to_lowercase(), "α σ");
    assert_eq!("Α 'Σ".to_lowercase(), "α 'σ");
    assert_eq!("Α ''Σ".to_lowercase(), "α ''σ");

    assert_eq!("Σ".to_lowercase(), "σ");
    assert_eq!("'Σ".to_lowercase(), "'σ");
    assert_eq!("''Σ".to_lowercase(), "''σ");

    assert_eq!("ΑΣΑ".to_lowercase(), "ασα");
    assert_eq!("ΑΣ'Α".to_lowercase(), "ασ'α");
    assert_eq!("ΑΣ''Α".to_lowercase(), "ασ''α");
}

#[test]
fn std_to_uppercase() {
    assert_eq!("".to_uppercase(), "");
    assert_eq!("aéDžßfiᾀ".to_uppercase(), "AÉDŽSSFIἈΙ");
}

#[test]
fn new_to_lowercase() {
    assert_eq!(to_lowercase(""), "");
    assert_eq!(to_lowercase("AÉDžaé "), "aédžaé ");

    // https://github.com/rust-lang/rust/issues/26035
    assert_eq!(to_lowercase("ΑΣ"), "ας");
    assert_eq!(to_lowercase("Α'Σ"), "α'ς");
    assert_eq!(to_lowercase("Α''Σ"), "α''ς");

    assert_eq!(to_lowercase("ΑΣ Α"), "ας α");
    assert_eq!(to_lowercase("Α'Σ Α"), "α'ς α");
    assert_eq!(to_lowercase("Α''Σ Α"), "α''ς α");

    assert_eq!(to_lowercase("ΑΣ' Α"), "ας' α");
    assert_eq!(to_lowercase("ΑΣ'' Α"), "ας'' α");

    assert_eq!(to_lowercase("Α'Σ' Α"), "α'ς' α");
    assert_eq!(to_lowercase("Α''Σ'' Α"), "α''ς'' α");

    assert_eq!(to_lowercase("Α Σ"), "α σ");
    assert_eq!(to_lowercase("Α 'Σ"), "α 'σ");
    assert_eq!(to_lowercase("Α ''Σ"), "α ''σ");

    assert_eq!(to_lowercase("Σ"), "σ");
    assert_eq!(to_lowercase("'Σ"), "'σ");
    assert_eq!(to_lowercase("''Σ"), "''σ");

    assert_eq!(to_lowercase("ΑΣΑ"), "ασα");
    assert_eq!(to_lowercase("ΑΣ'Α"), "ασ'α");
    assert_eq!(to_lowercase("ΑΣ''Α"), "ασ''α");
}

#[test]
fn new_to_uppercase() {
    assert_eq!(to_uppercase(""), "");
    assert_eq!(to_uppercase("aéDžßfiᾀ"), "AÉDŽSSFIἈΙ");
}

#[bench]
fn upper_new_really_bad(b: &mut Bencher) {
    b.iter(|| to_uppercase("aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA"));
}

#[bench]
fn upper_new_worst(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_new_mixed(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa"));
}

#[bench]
fn upper_new_best(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode(b: &mut Bencher) {
    b.iter(|| to_uppercase("AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA"));
}

#[bench]
fn upper_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_uppercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn upper_std_really_bad(b: &mut Bencher) {
    b.iter(|| "aAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaA".to_uppercase());
}

#[bench]
fn upper_std_worst(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_mixed(b: &mut Bencher) {
    b.iter(|| "AAAaaaaaaaaaaaaAAAAAAAAAaaaaAaaaAAAAaaaaAAaaaaaaaAAAAaaaaa".to_uppercase());
}

#[bench]
fn upper_std_unicode(b: &mut Bencher) {
    b.iter(|| "AAAßAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAæAÆæAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn upper_std_unicode2(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_uppercase());
}

#[bench]
fn upper_std_best(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_uppercase());
}

#[bench]
fn lower_new_really_bad(b: &mut Bencher) {
    b.iter(|| to_lowercase("AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa"));
}

#[bench]
fn lower_new_worst(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_new_mixed(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA"));
}

#[bench]
fn lower_new_best(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode(b: &mut Bencher) {
    b.iter(|| to_lowercase("aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa"));
}

#[bench]
fn lower_new_unicode2(b: &mut Bencher) {
    b.iter(|| to_lowercase("AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA"));
}

#[bench]
fn lower_std_really_bad(b: &mut Bencher) {
    b.iter(|| "AaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAa".to_lowercase());
}

#[bench]
fn lower_std_worst(b: &mut Bencher) {
    b.iter(|| "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_mixed(b: &mut Bencher) {
    b.iter(|| "aaaAAAAAAAAAAAAaaaaaaaaaAAAAaAAAaaaaAAAAaaAAAAAAAaaaaAAAAA".to_lowercase());
}

#[bench]
fn lower_std_unicode(b: &mut Bencher) {
    b.iter(|| "aaaßaaaaÐaaaaaaÆÆÆÆÆaaßaaÐÐÐÐaaaaaaaaaÆaæÆaaaaaaaaaaaaaaaa".to_lowercase());
}

#[bench]
fn lower_std_unicode2(b: &mut Bencher) {
    b.iter(|| "AAAЪAAAAðAAAAAAæææææAAßAAððððAAAAAAAAAÆAÆÆAAAAAAAAAAAAAAAA".to_lowercase());
}

#[bench]
fn lower_std_best(b: &mut Bencher) {
    b.iter(|| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".to_lowercase());
}

@meh
Copy link
Contributor Author

meh commented Sep 2, 2016

Looks like an unrelated failure on travis.

@frewsxcv
Copy link
Member

frewsxcv commented Sep 2, 2016

Looks like an unrelated failure on travis.

#36138

// result or extend with the lower case version if an upper case
// character is found.
for (i, ch) in self.char_indices() {
if !ch.is_lowercase() && ch.is_alphabetic() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps be ch != ch.to_lowercase()?

I'm wary about saying that !ch.is_lowercase() is equivalent to the logic above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ch.to_lowercase would return an Iterator, so I'd have to create it, make it peekable, check the first character is different and then use it.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

If someone with Unicode knowledge can chime in it would be nice, I'm wary about it too.

From what I understand non-alphabetic characters can't be turned to lower case, and the logic is that some characters (title case and non-alphabetic ones) are neither upper case nor lower case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @SimonSapin, tweaks in the behavior of to_lowercase here.

The optimization here is to collect up chunks of a string that are entirely lowercase and then push it all onto the string all at once (instead of one at a time). The logic is to instead test if each character is lowercase already and just maintain some indexes if that's the case.

We're worried though that this may not produce the same results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand non-alphabetic characters can't be turned to lower case,

Unicode is full of surprising corner cases, so I wouldn’t rely on something like this on faith.

I’m opposed to skipping char::to_lowercase unless you make src/etc/unicode.py check exhaustively that it returns its input unchanged for ever code point where you would skip it, and that this stays the case when we updated to a new Unicode version.

The issue is it would defeat the purpose of the optimization, because the conversion is what takes most of the time.

Here are a few random ideas to improved this.

char::to_lowercase is implemented in src/librustc_unicode/tables.rs with a binary search, while char::is_lowercase and char::is_alphabetic use a trie of bits and have a fast-path for ASCII.

char::to_lowercase could use a similar BoolTrie to skip the binary search for code points that are unchanged (thought this would have a cost in binary size) and have its own ASCII fast path.

to_lowercase_table could be changed to store [u8; 9] (or whatever is the maximum length) in UTF-8 rather than [char; 3] in (effectively) UTF-32, so that pushing to a String doesn’t need to do an encoding conversion.

If doing larger copies from &self turns out to be significantly more efficient than pushing code points one (or up to three) at a time, the ToLowercase iterator could be extended to implement PartialEq<char> or equivalent, to find out if the conversion was a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be possible to replace the binary searches entirely, with tries. I’ll look into that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tries for case mapping is not as easy as I first thought: we’d need to play some more tricks to keep it somewhat space-efficient, such as encoding ranges of code points that map to a single code point at a constant offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding PartialEq<char> to ToLowercase and ToUppercase would already allow the optimization to work.

From what I tested the performance improvement comes from using push_str instead of extend, because the push_str gets optimized to a memcpy, so the bigger the already properly cased substring is the faster it gets.

I can look into adding the PartialEq<char> impl and using it here if you want.

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to resubmit with comments addressed!

@meh
Copy link
Contributor Author

meh commented Oct 31, 2016

@alexcrichton I was actually waiting for @SimonSapin to reply 🐼

@alexcrichton
Copy link
Member

ah in that case, ping @SimonSapin, I believe about this comment

@SimonSapin
Copy link
Contributor

@meh I’m not sure what response you’re expecting as I don’t see a question. I’ve said that I’m not confident that the patch as-is (using is_alphabetic) is correct, and suggested some alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants