UnicodeStrSlice: add width() and graphemes() methods #15619

kwantam · 2014-07-11T22:19:38Z

width() computes the displayed width of a string, ignoring the width of control characters.
- arguably we might do something else for control characters, but the question is, what?
- users who want to do something else can iterate over chars()
graphemes() returns a Graphemes struct, which implements an iterator over the grapheme clusters of a &str.
- fully compliant with UAX#29
- passes all Unicode-supplied tests
added code to generate additionial categories in unicode.py
- Cn aka Not_Assigned
- categories necessary for grapheme cluster breaking
tidied up the exports from libunicode
- all exports are exposed through a module rather than directly at crate root.
- std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

bluss · 2014-07-11T22:32:53Z

src/libunicode/u_str.rs

+    #[inline]
+    fn words(&self) -> Words<'a> {
+        self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
+    }


Just as an aside, we note that this is not an UAX #29 algorithm for word boundaries

Good point. I'll add that to my list of things to think about.

Offhand, I'm inclined to say that either this should be a UAX #29 algorithm for word boundaries, or it should be removed entirely. It's easy enough for users to write that same words() function themselves if they need it, and the existence of this function precludes adding the UAX #29 version (at least, without avoiding confusion down the road).

That's a good point.

On the other hand, the UAX#29 word breaking algorithm is a bit meaty, and often people don't need something that complicated. Maybe we should leave words() as is, and have a function with a different name that implements the UAX#29 algorithm? I don't want people to use words() and end up having a lot more complication than necessary.

Maybe this deserves its own issue?

I can certainly understand why you don't want to implement the UAX#29 algorithm right now, and so I think you should just drop words() entirely. It's easy for someone else to write the whitespace-splitting version if they decide that's what they want, but I don't think it's appropriate to call that "words" within the context of a unicode-focused library.

words() is a separate issue. It was just moved in this commit, but it should be addressed (renamed or removed).

I opened #15628, but this behaviour of &str.words has been around for a while, so can be addressed later.

kwantam · 2014-07-11T23:25:02Z

@alexcrichton r?

bluss · 2014-07-12T00:22:56Z

Using char_indices() is a great change 👍

huonw · 2014-07-12T11:01:36Z

src/libunicode/u_str.rs

-    /// let v: Vec<&str> = some_words.words().collect();
-    /// assert_eq!(v, vec!["Mary", "had", "a", "little", "lamb"]);
-    /// ```
-    fn words(&self) -> Words<'a>;


Is there any particular reason that these functions are moving around? (Would it be possible to leave them where they were, to minimise the noise in the diff?)

It would be. I just wanted to sort the functions in a more sensible order, because it seemed like they were pretty haphazardly thrown together. (...which is my fault; I did put them there, after all!)

I'll swap them back to their original order.

huonw · 2014-07-12T12:37:31Z

It seems that unicode.py has r = "unicode.rs", while it should be r = "tables.rs".

Kimundi · 2014-07-12T12:57:14Z

src/libunicode/u_str.rs

 }

 impl<'a> UnicodeStrSlice<'a> for &'a str {
    #[inline]
-    fn words(&self) -> Words<'a> {
-        self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
+    fn graphemes(&self, is_extended: bool) -> Graphemes<'a> {


I think it would make sense to provide grapheme_indices() as well, to complement char_indices()

Ah yes, I like that idea.

kwantam · 2014-07-13T16:29:14Z

Now that I think about it, probably it would also be nice to have a DoubleEndedIterator implementation for Graphemes and GraphemeIndices. I'll implement that tonight.

kwantam · 2014-07-14T01:01:30Z

OK, we have DoubleEndedIterator for Graphemes and GraphemeIndices.

Is there a nicer way of reversing a static slice than slice.iter().rev().collect()? I'm not so worried about the efficiency of the test code, but that doesn't mean I want to be doing something boneheaded.

huonw · 2014-07-14T01:30:54Z

You can just test equality of two iterators using std::iter::order::equals and so avoid the collect entirely.

kwantam · 2014-07-14T04:27:05Z

👍 thanks

huonw · 2014-07-14T23:11:07Z

src/libunicode/u_str.rs

+    /// # Example
+    ///
+    /// ```rust
+    /// let gr1 = "a̐éö̲".graphemes(true).collect::<Vec<&str>>();


I think using explicit code-point notation (as I used previously) may be better here, because the interesting thing with this iterator is how it groups multiple codepoints into a single entity.

When written in short form, it's not clear that these aren't individual codepoints.

huonw · 2014-07-14T23:15:21Z

src/libunicode/u_str.rs

+                _ => self.catb.take_unwrap()
+            };
+
+            state = match state {


To be clear, this table is effectively running http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules in reverse, right?

Yes. It implements the same splits, but running from the end of the string rather than the front.

- Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change]

kwantam · 2014-07-14T23:56:08Z

@alexcrichton @huonw should be good to go now.

Also realized that I needed [breaking-change] in the commit message because I tidied the export locations for libunicode, and programs that link directly to it would break. (Since it's pretty young, I doubt there are any such programs yet...)

- `width()` computes the displayed width of a string, ignoring the width of control characters. - arguably we might do *something* else for control characters, but the question is, what? - users who want to do something else can iterate over chars() - `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str. - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29) - added code to generate additionial categories in `unicode.py` - `Cn` aka `Not_Assigned` - categories necessary for grapheme cluster breaking - tidied up the exports from libunicode - all exports are exposed through a module rather than directly at crate root. - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode closes #7043

ariasuni · 2017-06-14T22:54:44Z

How is that still Unstable? Should I open a new issue to stabilize this (at least graphemes()?)

bluss · 2017-08-03T18:24:56Z

@ariasuni It's not included even as experimental feature yet -- so the first step you can do is to get it into Rust as an unstable feature (write a PR, discuss on the forum or even write an RFC, which is probably not needed).

mamciek · 2017-12-28T12:45:08Z

FYI: It was removed from master
"Remove all unstable deprecated functionality"
8d90d3f

kwantam mentioned this pull request Jul 11, 2014

Unicode grapheme support #7043

Closed

bluss reviewed Jul 11, 2014
View reviewed changes

huonw reviewed Jul 12, 2014
View reviewed changes

Kimundi reviewed Jul 12, 2014
View reviewed changes

huonw reviewed Jul 14, 2014
View reviewed changes

kwantam added 2 commits July 14, 2014 19:53

add UnicodeStrSlice width() function

c066a1e

bors closed this Jul 16, 2014

bors merged commit cf432b8 into rust-lang:master Jul 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeStrSlice: add width() and graphemes() methods #15619

UnicodeStrSlice: add width() and graphemes() methods #15619

kwantam commented Jul 11, 2014

bluss Jul 11, 2014

kwantam Jul 11, 2014

lilyball Jul 11, 2014

kwantam Jul 12, 2014

lilyball Jul 12, 2014

bluss Jul 12, 2014

huonw Jul 12, 2014

kwantam commented Jul 11, 2014

bluss commented Jul 12, 2014

huonw Jul 12, 2014

kwantam Jul 12, 2014

huonw commented Jul 12, 2014

Kimundi Jul 12, 2014

kwantam Jul 12, 2014

kwantam Jul 13, 2014

kwantam commented Jul 13, 2014

kwantam commented Jul 14, 2014

huonw commented Jul 14, 2014

kwantam commented Jul 14, 2014

huonw Jul 14, 2014

huonw Jul 14, 2014

kwantam Jul 14, 2014

kwantam commented Jul 14, 2014

ariasuni commented Jun 14, 2017

bluss commented Aug 3, 2017 •

edited

Loading

mamciek commented Dec 28, 2017

UnicodeStrSlice: add width() and graphemes() methods #15619

UnicodeStrSlice: add width() and graphemes() methods #15619

Conversation

kwantam commented Jul 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwantam commented Jul 11, 2014

bluss commented Jul 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw commented Jul 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwantam commented Jul 13, 2014

kwantam commented Jul 14, 2014

huonw commented Jul 14, 2014

kwantam commented Jul 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwantam commented Jul 14, 2014

ariasuni commented Jun 14, 2017

bluss commented Aug 3, 2017 • edited Loading

mamciek commented Dec 28, 2017

bluss commented Aug 3, 2017 •

edited

Loading