-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeStrSlice: add width() and graphemes() methods #15619
Conversation
#[inline] | ||
fn words(&self) -> Words<'a> { | ||
self.split(u_char::is_whitespace).filter(|s| !s.is_empty()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an aside, we note that this is not an UAX #29 algorithm for word boundaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll add that to my list of things to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offhand, I'm inclined to say that either this should be a UAX #29 algorithm for word boundaries, or it should be removed entirely. It's easy enough for users to write that same words()
function themselves if they need it, and the existence of this function precludes adding the UAX #29 version (at least, without avoiding confusion down the road).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
On the other hand, the UAX#29 word breaking algorithm is a bit meaty, and often people don't need something that complicated. Maybe we should leave words() as is, and have a function with a different name that implements the UAX#29 algorithm? I don't want people to use words() and end up having a lot more complication than necessary.
Maybe this deserves its own issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can certainly understand why you don't want to implement the UAX#29 algorithm right now, and so I think you should just drop words()
entirely. It's easy for someone else to write the whitespace-splitting version if they decide that's what they want, but I don't think it's appropriate to call that "words" within the context of a unicode-focused library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
words()
is a separate issue. It was just moved in this commit, but it should be addressed (renamed or removed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #15628, but this behaviour of &str.words
has been around for a while, so can be addressed later.
Using |
/// let v: Vec<&str> = some_words.words().collect(); | ||
/// assert_eq!(v, vec!["Mary", "had", "a", "little", "lamb"]); | ||
/// ``` | ||
fn words(&self) -> Words<'a>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason that these functions are moving around? (Would it be possible to leave them where they were, to minimise the noise in the diff?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be. I just wanted to sort the functions in a more sensible order, because it seemed like they were pretty haphazardly thrown together. (...which is my fault; I did put them there, after all!)
I'll swap them back to their original order.
It seems that |
} | ||
|
||
impl<'a> UnicodeStrSlice<'a> for &'a str { | ||
#[inline] | ||
fn words(&self) -> Words<'a> { | ||
self.split(u_char::is_whitespace).filter(|s| !s.is_empty()) | ||
fn graphemes(&self, is_extended: bool) -> Graphemes<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to provide grapheme_indices()
as well, to complement char_indices()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I like that idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Now that I think about it, probably it would also be nice to have a |
OK, we have Is there a nicer way of reversing a static slice than |
You can just test equality of two iterators using |
👍 thanks |
/// # Example | ||
/// | ||
/// ```rust | ||
/// let gr1 = "a̐éö̲".graphemes(true).collect::<Vec<&str>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using explicit code-point notation (as I used previously) may be better here, because the interesting thing with this iterator is how it groups multiple codepoints into a single entity.
When written in short form, it's not clear that these aren't individual codepoints.
_ => self.catb.take_unwrap() | ||
}; | ||
|
||
state = match state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this table is effectively running http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules in reverse, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It implements the same splits, but running from the end of the string rather than the front.
- Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change]
@alexcrichton @huonw should be good to go now. Also realized that I needed |
- `width()` computes the displayed width of a string, ignoring the width of control characters. - arguably we might do *something* else for control characters, but the question is, what? - users who want to do something else can iterate over chars() - `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str. - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29) - added code to generate additionial categories in `unicode.py` - `Cn` aka `Not_Assigned` - categories necessary for grapheme cluster breaking - tidied up the exports from libunicode - all exports are exposed through a module rather than directly at crate root. - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode closes #7043
How is that still Unstable? Should I open a new issue to stabilize this (at least |
@ariasuni It's not included even as experimental feature yet -- so the first step you can do is to get it into Rust as an unstable feature (write a PR, discuss on the forum or even write an RFC, which is probably not needed). |
FYI: It was removed from master |
width()
computes the displayed width of a string, ignoring the width of control characters.graphemes()
returns aGraphemes
struct, which implements an iterator over the grapheme clusters of a &str.unicode.py
Cn
akaNot_Assigned
closes #7043