Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeStrSlice: add width() and graphemes() methods #15619

Merged
merged 2 commits into from
Jul 16, 2014
Merged

UnicodeStrSlice: add width() and graphemes() methods #15619

merged 2 commits into from
Jul 16, 2014

Conversation

kwantam
Copy link
Contributor

@kwantam kwantam commented Jul 11, 2014

  • width() computes the displayed width of a string, ignoring the width of control characters.
    • arguably we might do something else for control characters, but the question is, what?
    • users who want to do something else can iterate over chars()
  • graphemes() returns a Graphemes struct, which implements an iterator over the grapheme clusters of a &str.
  • added code to generate additionial categories in unicode.py
    • Cn aka Not_Assigned
    • categories necessary for grapheme cluster breaking
  • tidied up the exports from libunicode
    • all exports are exposed through a module rather than directly at crate root.
    • std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

closes #7043

#[inline]
fn words(&self) -> Words<'a> {
self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an aside, we note that this is not an UAX #29 algorithm for word boundaries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll add that to my list of things to think about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offhand, I'm inclined to say that either this should be a UAX #29 algorithm for word boundaries, or it should be removed entirely. It's easy enough for users to write that same words() function themselves if they need it, and the existence of this function precludes adding the UAX #29 version (at least, without avoiding confusion down the road).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

On the other hand, the UAX#29 word breaking algorithm is a bit meaty, and often people don't need something that complicated. Maybe we should leave words() as is, and have a function with a different name that implements the UAX#29 algorithm? I don't want people to use words() and end up having a lot more complication than necessary.

Maybe this deserves its own issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly understand why you don't want to implement the UAX#29 algorithm right now, and so I think you should just drop words() entirely. It's easy for someone else to write the whitespace-splitting version if they decide that's what they want, but I don't think it's appropriate to call that "words" within the context of a unicode-focused library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

words() is a separate issue. It was just moved in this commit, but it should be addressed (renamed or removed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #15628, but this behaviour of &str.words has been around for a while, so can be addressed later.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 11, 2014

@alexcrichton r?

@bluss
Copy link
Member

bluss commented Jul 12, 2014

Using char_indices() is a great change 👍

/// let v: Vec<&str> = some_words.words().collect();
/// assert_eq!(v, vec!["Mary", "had", "a", "little", "lamb"]);
/// ```
fn words(&self) -> Words<'a>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason that these functions are moving around? (Would it be possible to leave them where they were, to minimise the noise in the diff?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be. I just wanted to sort the functions in a more sensible order, because it seemed like they were pretty haphazardly thrown together. (...which is my fault; I did put them there, after all!)

I'll swap them back to their original order.

@huonw
Copy link
Member

huonw commented Jul 12, 2014

It seems that unicode.py has r = "unicode.rs", while it should be r = "tables.rs".

}

impl<'a> UnicodeStrSlice<'a> for &'a str {
#[inline]
fn words(&self) -> Words<'a> {
self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
fn graphemes(&self, is_extended: bool) -> Graphemes<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to provide grapheme_indices() as well, to complement char_indices()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I like that idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 13, 2014

Now that I think about it, probably it would also be nice to have a DoubleEndedIterator implementation for Graphemes and GraphemeIndices. I'll implement that tonight.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 14, 2014

OK, we have DoubleEndedIterator for Graphemes and GraphemeIndices.

Is there a nicer way of reversing a static slice than slice.iter().rev().collect()? I'm not so worried about the efficiency of the test code, but that doesn't mean I want to be doing something boneheaded.

@huonw
Copy link
Member

huonw commented Jul 14, 2014

You can just test equality of two iterators using std::iter::order::equals and so avoid the collect entirely.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 14, 2014

👍 thanks

/// # Example
///
/// ```rust
/// let gr1 = "a̐éö̲".graphemes(true).collect::<Vec<&str>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using explicit code-point notation (as I used previously) may be better here, because the interesting thing with this iterator is how it groups multiple codepoints into a single entity.

When written in short form, it's not clear that these aren't individual codepoints.

_ => self.catb.take_unwrap()
};

state = match state {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this table is effectively running http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules in reverse, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It implements the same splits, but running from the end of the string rather than the front.

kwantam added 2 commits July 14, 2014 19:53
- Graphemes and GraphemeIndices structs implement iterators over
  grapheme clusters analogous to the Chars and CharOffsets for chars in
  a string. Iterator and DoubleEndedIterator are available for both.

- tidied up the exports for libunicode. crate root exports are now moved
  into more appropriate module locations:
  - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module
  - UnicodeChar exported from char instead of crate root
  - canonical_combining_class is exported from str rather than crate root

Since libunicode's exports have changed, programs that previously relied
on the old export locations will need to change their `use` statements
to reflect the new ones. See above for more information on where the new
exports live.

closes #7043
[breaking-change]
@kwantam
Copy link
Contributor Author

kwantam commented Jul 14, 2014

@alexcrichton @huonw should be good to go now.

Also realized that I needed [breaking-change] in the commit message because I tidied the export locations for libunicode, and programs that link directly to it would break. (Since it's pretty young, I doubt there are any such programs yet...)

bors added a commit that referenced this pull request Jul 15, 2014
- `width()` computes the displayed width of a string, ignoring the width of control characters.
    - arguably we might do *something* else for control characters, but the question is, what?
    - users who want to do something else can iterate over chars()

- `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str.
    - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
    - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29)

- added code to generate additionial categories in `unicode.py`
    - `Cn` aka `Not_Assigned`
    - categories necessary for grapheme cluster breaking

- tidied up the exports from libunicode
  - all exports are exposed through a module rather than directly at crate root.
  - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

closes #7043
@bors bors closed this Jul 16, 2014
@bors bors merged commit cf432b8 into rust-lang:master Jul 16, 2014
@ariasuni
Copy link
Contributor

How is that still Unstable? Should I open a new issue to stabilize this (at least graphemes()?)

@bluss
Copy link
Member

bluss commented Aug 3, 2017

@ariasuni It's not included even as experimental feature yet -- so the first step you can do is to get it into Rust as an unstable feature (write a PR, discuss on the forum or even write an RFC, which is probably not needed).

@mamciek
Copy link

mamciek commented Dec 28, 2017

FYI: It was removed from master
"Remove all unstable deprecated functionality"
8d90d3f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode grapheme support
8 participants