-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to TinyStr #7
Conversation
53a46e5
to
60c3bc0
Compare
@raphlinus - hi! would you have time to take a look at this and verify if I'm doing the right thing? I'm concerned about the unchecked perf and also about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments inline.
Unchecked performance is an interesting question - the reason why it was good before is that you allowed Cow<'static, str>
so there's basically no work to be done. Since these are static strings, there is a potential strategy, which is to use the integers directly. One thing to explore is whether aggressive inlining of the TinyStr
constructor would let the compiler do this. If const fn
gets more powerful, then it might be possible to write the code carefully to be const fn.
Another possibility to explore, if identifiers from static sources are important, is a macro.
unic-langid-impl/src/tinystr.rs
Outdated
pub fn new_unchecked(text: &str) -> TinyStr8 { | ||
unsafe { | ||
let mut word: u64 = 0; | ||
copy_nonoverlapping(text.as_ptr(), &mut word as *mut u64 as *mut u8, text.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be possible to do this without unsafe, using a [u8; 8]
buffer and then u64::from_ne_bytes, which is present as of 1.32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm trying to do that, but I failed to find a way to get [u8; 8]
out of a &str
. What API should I use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, I'll actually go with dumping numbers and new_unchecked
will take it back.
unic-langid-impl/src/tinystr.rs
Outdated
fn cmp(&self, other: &TinyStr8) -> Ordering { | ||
//XXX: Is that correct? | ||
// I see that `nedis` < `macos` otherwise - see tests :( | ||
other.0.cmp(&self.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect what you want to do here is self.0.to_be().cmp(other.0.to_be())
.
unic-langid-impl/src/tinystr.rs
Outdated
assert!(TinyStr8::try_new("abcXYZ").unwrap().is_all_ascii_alpha()); | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely needs more cases: strings of different lengths, etc.
unic-langid-impl/src/tinystr.rs
Outdated
pub struct TinyStr4(NonZeroU32); | ||
|
||
impl TinyStr8 { | ||
pub fn new_unchecked(text: &str) -> TinyStr8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsafe as written, so should be pub unsafe fn
.
This is exactly for a macro - see https://github.com/zbraniecki/unic-locale/blob/master/unic-langid-macros-impl/src/lib.rs#L12 and https://github.com/zbraniecki/unic-locale/blob/master/unic-langid-macros/examples/langid-macro.rs I'd like to not make it slower if possible. I'm wondering if I could just have a method that returns the structure as TinyStr that can be passed to from_parts_unchecked just for the macro use. |
You should be able to do this by having a TinyStr constructor that takes an integer, and having your macro compute the integer for the string. Then performance would be spectacular. One thing to watch out for is endianness: I recommend the TinyStr constructor always take the string in little-endian, so do Also keep in mind, that constructor needs to be unsafe, because otherwise you can deref the resulting TinyStr and get an invalid string. |
I updated this patch to be based on #8. The unchecked performance is now on par with the previous solution! :) |
53b1a32
to
47f6efc
Compare
Per emilio recommendation I also switched variants from I could try to bring |
Wait, why do you need ToTokens on TinyStr? In the macro you can just expand to an invocation of new_unchecked, no? |
Yeah, but then I need to produce: let variants: Vec<_> = variants
.into_iter()
.map(|v| quote!($crate::TinyStr8::new_unchecked(#v)))
.collect();
let variants = quote!(Box::new([#(#variants,)*])); Am I wrong? |
You'll need to expose an |
Never mind there's already an Into impl |
What's wrong with that approach? |
When I try to do: static LANGID: LanguageIdentifier = langid!("en-US") I am getting an error: error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
--> unic-langid/examples/simple-langid.rs:6:37
|
6 | static LANGID: LanguageIdentifier = langid!("en-US"); |
Ah. That could be fixed if we used SmallVec as well I guess. I guess you still can't allocate in const? |
Anyway you can always put these in lazy_static |
Why would it be fixed? SmallVec constructor also is not const. |
@Manishearth - I think this is big enough to warrant a review on its own before I pile up more. The TinyStr crate is in review in a separate issue so this is only about the second commit. |
We can add one for constructing from arrays :) |
yes pls :) |
Is there a reason we're vendoring TinyStr? Is this the canonical location? May be worth maintaining in a separate repo |
No preference. If you think it's worth it, I can separate out. |
That would be ideal, thanks 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the tinystr code, but this looks good!
script: Option<Cow<'static, str>>, | ||
region: Option<Cow<'static, str>>, | ||
variants: Vec<Cow<'static, str>>, | ||
language: Option<TinyStr8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment saying that these are guaranteed to be <=4
81c1bef
to
80462ce
Compare
This is a port of https://github.com/projectfluent/fluent-locale-rs/pull/8/files to unic-langid.
It's a WIP patch, but it works and passes all tests. I think there are couple more optimizations possible and I'd also like to get feedback from the author on my attempt to separate
try_new
fromnew_unchecked
.Performance looks good, except of the unchecked:
This patch so far is backwards compatible which is super nice. But I'm now wondering if it should return
TinyStr
fromget_*
methods (which, I assume, would make the unchecked constructor cheaper/faster), but it would be a breaking change and would let people operate on TinyStr rather than str. Not sure if it's worth it...